"RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable" with DistributedDataParallel

Hello all,

I have a training loop script that I am attempting to run on a shared university cluster. The training loop runs fine when set to run it on a single GPU using one process, and it also runs fine when set to run on multiple GPUs using one process (DataParallel). But the moment I attempt to run the loop on DistributedDataParallel by spawning a process group using torch.multiprocessing.spawn, the spawned processes raise an exception saying:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/ynagano/.local/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/ynagano/cdr3encoding/pretrain.py", line 308, in train
    model = DistributedDataParallel(model, device_ids=[device])
  File "/home/ynagano/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I have run an interactive python interpreter in the same environment to verify that it is possible to utilise any of the supposedly available CUDA devices normally (i.e. creating a tensor and attempting to move it onto the device with something like sometensor.to(0)). So it doesn’t make sense to me that the processes are complaining about all CUDA devices being unavailable.

Here is the output of nvidia-smi in the environment in question. It shows that the cards are 100% available and that no processes are running on them…:

Sun Feb 20 11:33:26 2022       
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:3B:00.0 Off |                  N/A |
| 27%   26C    P8     1W / 250W |      0MiB / 11019MiB |      0%   E. Process |
|                               |                      |                  N/A |
|   1  NVIDIA GeForce ...  On   | 00000000:B1:00.0 Off |                  N/A |
| 27%   27C    P8     6W / 250W |      0MiB / 11019MiB |      0%   E. Process |
|                               |                      |                  N/A |
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |

I am using:

  • Python 3.9.5
  • Pytorch 1.10.1+cu113

Any help would be appreciated. Thank you!

Update: I have retried running my code on:

  • Python 3.8.5
  • Pytorch 1.10.1 (cuda 10.2)

And everything now works as expected. I suspect that perhaps the error I got is a problem with the torch package using the CUDA 11.3 backend.

1 Like