DistributedDataParallel: RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

Based on the output of nvidia-smi, it seems the GPUs are in EXCLUSIVE Process mode, which would allow only a single context.

nvidia-smi -i 0 -c 0
nvidia-smi -i 1 -c 0
# or for both directly
nvidia-smi -c 0 

should reset both GPUs to the default mode again.

2 Likes