CUDA

A C/C++/Fortran parallel computing platform and application programming interface (API) that allows software to use graphics processing units (GPUs) for general purpose processing.

Homepage Version(s): 11.8 and 12.1

CUDA requires an environment module

In order to use CUDA, you must first load the appropriate environment module:

module load cuda

Warning

Due to disk space constraints, NVIDIA CUDA libraries are avaialble only on the login nodes and GPU nodes. They are not available on general-purpose compute nodes. Be sure to specify the Slurm --gres:gpu=[1-4] option when submitting jobs to the cluster.

Compiling with CUDA#

Once you have loaded the CUDA module (module load cuda), the nvcc command will be available:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

You can now use the nvcc command to compile C/C++ code:

1	`$ nvcc -03 -arch sm_61 -o a.out a.cu`

In the above example, the compiler option -arch sm_61 specifies the compute capability 6.1 for the Pascal micro-architecture.

Submit CUDA jobs#

CUDA jobs are similar to regular HPC jobs, with two additional considerations:

You need to request GPU resources from the scheduler with the --gres=gpu:1 option.
You need to load the CUDA module (module load cuda)

Below is an example job submit script for a CUDA job:

#!/bin/bash

#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --account genacc_q
#SBATCH --gres=gpu:1  # Ensure your job is scheduled on a node with GPU resources (1)

# Load CUDA module libraries
module load cuda # Load the CUDA libraries into the environment (2)

# Execute your CUDA code
srun -n 1 ./my_cuda_code < input.dat > output.txt

Note you can change the number of GPUs you request by changing the number on the end (e.g. --gres=gpu:2). Most general access nodes have a maximum of two GPU cards. Owner accounts may have more.
If you need a specific version of the CUDA libraries, you can run $ module avail cuda on any node to see what versions are available and then update this line accordingly (e.g. module load cuda/10.1).

CUDA Example#

The following CUDA code example can help new users get familiar with the GPU resources available in the HPC cluster.

Create a file called deviceQuery.cu:

    #include <stdio.h>
    #include <cuda_runtime.h>
    int main( ) {
        int dev = 0;
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, dev);
        printf("device id %d, name %s\n", dev, prop.name);
        printf("number of multi-processors = %d\n", 
            prop.multiProcessorCount);
        printf("Total constant memory: %4.2f kb\n", 
            prop.totalConstMem/1024.0);
        printf("Shared memory per block: %4.2f kb\n",
            prop.sharedMemPerBlock/1024.0);
        printf("Total registers per block: %d\n", 
            prop.regsPerBlock);
        printf("Maximum threads per block: %d\n", 
            prop.maxThreadsPerBlock);
        printf("Maximum threads per multi-processor: %d\n", 
            prop.maxThreadsPerMultiProcessor);
        printf("Maximum number of warps per multi-processor %d\n",
            prop.maxThreadsPerMultiProcessor/32);
        return 0;
    }

Compile the code:

1 2	`$ module load cuda $ nvcc -o deviceQuery deviceQuery.cu`

Create the job submit script (gpu_test.sh or some-such):

#!/bin/bash

#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --account backfill
#SBATCH -t 05:00
#SBATCH --gres=gpu:2 
#SBATCH --mail-type=ALL

# Load CUDA module libraries
module load cuda

# Execute your CUDA code
srun -n 1 ./deviceQuery

Submit the job:

1	`$ sbatch gpu_test.sh`

Wait for the job to finish running. When it finishes, the output should look something like the following:

device id 0, name GeForce GTX 1080 Ti
number of multi-processors = 28
Total constant memory: 64.00 kb
Shared memory per block: 48.00 kb
Total registers per block: 65536
Maximum threads per block: 1024
Maximum threads per multi-processor: 2048
Maximum number of warps per multi-processor 64