DistributedDataParallel: RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

Hello, I get the error RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable when I try to run my code on one server A with 2 GPUs while the code runs fine on another server B. What could be causing this issue? Could this be related to the CUDA Version? On the server A where the code fails the cuda version is 10.1 while the server B where the code runs has cuda version 11.

The full error (the process freezes after this output):

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/cluster/home/klugh/software/anaconda/envs/temp/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/cluster/home/klugh/software/anaconda/envs/temp/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/cluster/home/klugh/software/anaconda/envs/temp/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 109, in rebuild_cuda_tensor
    storage = storage_cls._new_shared_cuda(
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

Are you using a shared cluster/machine by any chance? The GPU may not be available if another application/user has taken control of it. You can check current GPU usage using the nvidia-smi command.

Hi @rvarm1, thanks for the answer! I am indeed using a shared cluster, but when I run nvidia-smi the GPUs seem to be free:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   37C    P0    44W / 300W |      0MiB / 32480MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   35C    P0    43W / 300W |      0MiB / 32480MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Hey @ngimel @ptrblck, have you seen errors like this before?

Based on the output of nvidia-smi, it seems the GPUs are in EXCLUSIVE Process mode, which would allow only a single context.

nvidia-smi -i 0 -c 0
nvidia-smi -i 1 -c 0
# or for both directly
nvidia-smi -c 0 

should reset both GPUs to the default mode again.

2 Likes

@ptrblck could you expand on what exclusive process vs shared mode entails? As far as I understand it’s a common practice in compute clusters to have the GPU set up in “exclusive process” mode and that’s not changeable by a user.
How does pytorch work when doing distributed work as opposed to the regular case?

The exclusive mode might be the right choice for your compute cluster and you can stick to it, if it’s working.
However, I would not recommend it as the default mode, if you are unsure about its limitations (single context creation) and are using your local workstation.

The recommended approach is to use DistributedDataParallel, with a single process per GPU.
Each device would thus create an own context.