Hello all,
I have a training loop script that I am attempting to run on a shared university cluster. The training loop runs fine when set to run it on a single GPU using one process, and it also runs fine when set to run on multiple GPUs using one process (DataParallel
). But the moment I attempt to run the loop on DistributedDataParallel
by spawning a process group using torch.multiprocessing.spawn
, the spawned processes raise an exception saying:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/ynagano/.local/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/ynagano/cdr3encoding/pretrain.py", line 308, in train
model = DistributedDataParallel(model, device_ids=[device])
File "/home/ynagano/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I have run an interactive python interpreter in the same environment to verify that it is possible to utilise any of the supposedly available CUDA devices normally (i.e. creating a tensor and attempting to move it onto the device with something like sometensor.to(0)
). So it doesn’t make sense to me that the processes are complaining about all CUDA devices being unavailable.
Here is the output of nvidia-smi
in the environment in question. It shows that the cards are 100% available and that no processes are running on them…:
Sun Feb 20 11:33:26 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:3B:00.0 Off | N/A |
| 27% 26C P8 1W / 250W | 0MiB / 11019MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:B1:00.0 Off | N/A |
| 27% 27C P8 6W / 250W | 0MiB / 11019MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I am using:
- Python 3.9.5
- Pytorch 1.10.1+cu113
Any help would be appreciated. Thank you!
Yuta