Using GPU

Hardware

Currently there are two GPU nodes:

Cluster	Nodes	GPU	#gpus	CPU	#cores	Memory	InfiniBand
gpu	1	Nvidia V100 32GB	7	Xeon Gold 5118	24	128GB	Mellanox 100Gb
gpu	1	Nvidia A100 80GB	2	Xeon Silver 4310	24	256GB	Mellanox 100Gb

To use GPU, include request_GPUs directive in the Condor job file. Here is an example:

Universe   = vanilla
Executable = exampleA
Arguments  = 1000
Log        = exampleA.log
Output     = exampleA.out
Error      = exampleA.error
request_CPUs = 1
request_GPUs = 1
+SJTU_GPUModel = "V100_32G"  # optional, V100 32G is the default GPU

should_transfer_files = NO
Queue

Condor will send GPU jobs to Nvidia V100 by default. To request job run on A100 GPU, add +SJTU_GPUModel directive. A job file for A100 GPU looks like:

Universe   = vanilla
Executable = exampleA
Arguments  = 1000
Log        = exampleA.log
Output     = exampleA.out
Error      = exampleA.error
request_CPUs = 1
request_GPUs = 1
+SJTU_GPUModel = "A100_80G"

should_transfer_files = NO
Queue

One Nvidia V100 GPU is reserved for testing purposes, and it can run a job for a maximum of one hour. To request this GPU, you can use the +SJTU_GPUModel directive:

Universe   = vanilla
Executable = exampleA
Arguments  = 1000
Log        = exampleA.log
Output     = exampleA.out
Error      = exampleA.error
request_CPUs = 1   # no more than 4 CPUs
request_GPUs = 1
+SJTU_GPUModel = "V100_32G_Test"

should_transfer_files = NO
Queue

Do not use CUDA_VISIBLE_DEVICES

CUDA_VISIBLE_DEVICES is an environment variable used by CUDA API to restrict devices a GPU application can see. It is how HTCondor allocate GPU to user.

Occasionally, user writes and tests code on other machines. One may not notice CUDA_VISIBLE_DEVICES is hidden in the code. It can be export CUDA_VISIBLE_DEVICES=i,j… in a shell script, or os.environ[“CUDA_VISIBLE_DEVICES”]=”i,j…” buried deep in python code. In either case, it overrides HTCondor, causes programs run on GPU not assigned to them.

GPU node finds and kills illegal processes running on GPU not assigned by HTCondor. If your GPU job starts OKAY but disappears “mysteriously” after few minutes, looking for CUDA_VISIBLE_DEVICES in your code.

CUDA Toolkit

As of Auguest 2022, the default CUDA Toolkit on V100 node is version 11.5. It is CUDA 11.6 on A100 node.