Skip to content


CUDA (Compute Unified Device Architecture) is a C/C++/Fortran parallel computing platform and application programming interface (API) that allows software to use graphics processing units (GPUs) for general purpose processing.


Due to disk space constraints, NVIDIA CUDA libraries are avaialble only on the login nodes and GPU nodes. They are not available on general-purpose compute nodes. Be sure to specify the Slurm --gres:gpu=[1-4] option when submitting jobs to the cluster.

Compiling with CUDA#

Once you have loaded the CUDA module (module load cuda), the nvcc command will be available:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

You can now use the nvcc command to compile C/C++ code:

$ nvcc -03 -arch sm_61 -o a.out

In the above example, the compiler option -arch sm_61 specifies the compute capability 6.1 for the Pascal micro-architecture.

Submit CUDA jobs#

CUDA jobs are similar to regular HPC jobs, with two additional considerations:

  1. You need to request GPU resources from the scheduler with the --gres=gpu:1 option.
  2. You need to load the CUDA module (module load cuda)

Below is an example job submit script for a CUDA job:


#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --account genacc_q
#SBATCH --gres=gpu:1  # Ensure your job is scheduled on a node with GPU resources (1)

# Load CUDA module libraries
module load cuda # Load the CUDA libraries into the environment (2)

# Execute your CUDA code
srun -n 1 ./my_cuda_code < input.dat > output.txt
  1. Note you can change the number of GPUs you request by changing the number on the end (e.g. --gres=gpu:2). Most general access nodes have a maximum of two GPU cards. Owner accounts may have more.
  2. If you need a specific version of the CUDA libraries, you can run $ module avail cuda on any node to see what versions are available and then update this line accordingly (e.g. module load cuda/10.1).

CUDA Example#

The following CUDA code example can help new users get familiar with the GPU resources available in the HPC cluster.

Create a file called

    #include <stdio.h>
    #include <cuda_runtime.h>
    int main( ) {
        int dev = 0;
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, dev);
        printf("device id %d, name %s\n", dev,;
        printf("number of multi-processors = %d\n", 
        printf("Total constant memory: %4.2f kb\n", 
        printf("Shared memory per block: %4.2f kb\n",
        printf("Total registers per block: %d\n", 
        printf("Maximum threads per block: %d\n", 
        printf("Maximum threads per multi-processor: %d\n", 
        printf("Maximum number of warps per multi-processor %d\n",
        return 0;

Compile the code:

$ module load cuda
$ nvcc -o deviceQuery

Create the job submit script ( or some-such):


#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --account backfill
#SBATCH -t 05:00
#SBATCH --gres=gpu:2 
#SBATCH --mail-type=ALL

# Load CUDA module libraries
module load cuda

# Execute your CUDA code
srun -n 1 ./deviceQuery

Submit the job:

$ sbatch

Wait for the job to finish running. When it finishes, the output should look something like the following:

device id 0, name GeForce GTX 1080 Ti
number of multi-processors = 28
Total constant memory: 64.00 kb
Shared memory per block: 48.00 kb
Total registers per block: 65536
Maximum threads per block: 1024
Maximum threads per multi-processor: 2048
Maximum number of warps per multi-processor 64