Using GPU

Hardware

Currently there are two pure GPU nodes, the OS is Ubuntu 20.04:

Cluster

Nodes

GPU

#gpus

CPU

#cores

Memory

InfiniBand

gpu

1

Nvidia V100 32GB

7

Xeon Gold 5118

24

128GB

Mellanox 100Gb

gpu

1

Nvidia A100 80GB

2

Xeon Silver 4310

24

256GB

Mellanox 100Gb

And two mixed CPU/GPU nodes, the OS is AlmaLinux9. Each node reserve 32 CPUs for GPU job, leave 160 CPUs for pure CPU job.

Cluster

Nodes

GPU

#gpus

CPU

#cores

Memory

InfiniBand

gpu

2

Nvidia L20 48GB

2

AMD EPYC 9654

192

768GB

Mellanox 100Gb

To use GPU, include request_GPUs directive in the Condor job file. Here is an example:

Universe   = vanilla
Executable = exampleA
Arguments  = 1000
Log        = exampleA.log
Output     = exampleA.out
Error      = exampleA.error
request_CPUs = 1
request_GPUs = 1
+SJTU_GPUModel = "V100_32G"  # optional, V100 32G is the default GPU

# request job run on A100 GPU
#+SJTU_GPUModel = "A100_80G"

# request job run on L20 GPU
#+SJTU_GPUModel = "L20_48G"

should_transfer_files = NO
Queue

Condor will send GPU jobs to Nvidia V100 by default. If +SJTU_GPUModel directive is used, job will be sent to the specified GPU.

One Nvidia V100 GPU is reserved for testing purposes, and it can run job for a maximum of one hour. To request this GPU, set the +SJTU_GPUModel directive to V100_32G_Test:

Universe   = vanilla
Executable = exampleA
Arguments  = 1000
Log        = exampleA.log
Output     = exampleA.out
Error      = exampleA.error
request_CPUs = 1   # no more than 4 CPUs
request_GPUs = 1
+SJTU_GPUModel = "V100_32G_Test"

should_transfer_files = NO
Queue

Do not use CUDA_VISIBLE_DEVICES

CUDA_VISIBLE_DEVICES is an environment variable used by CUDA API to restrict devices a GPU application can see. It is how HTCondor allocate GPU to user.

Occasionally, user writes and tests code on other machines. One may not notice CUDA_VISIBLE_DEVICES is hidden in the code. It can be export CUDA_VISIBLE_DEVICES=i,j… in a shell script, or os.environ[“CUDA_VISIBLE_DEVICES”]=”i,j…” buried deep in python code. In either case, it overrides HTCondor, causes programs run on GPU not assigned to them.

GPU node finds and kills illegal processes running on GPU not assigned by HTCondor. If your GPU job starts OKAY but disappears “mysteriously” after few minutes, looking for CUDA_VISIBLE_DEVICES in your code.

CUDA Toolkit

As of February 2026, the default CUDA Toolkit on V100 node is version 11.5; it is CUDA 11.6 on A100 node; CUDA 13.1 on L20 node.