Hello, I get the error RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable when I try to run my code on one server A with 2 GPUs while the code runs fine on another server B. What could be causing this issue? Could this be related to the CUDA Version? On the server A where the code fails the cuda version is 10.1 while the server B where the code runs has cuda version 11.
The full error (the process freezes after this output):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/cluster/home/klugh/software/anaconda/envs/temp/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/cluster/home/klugh/software/anaconda/envs/temp/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/cluster/home/klugh/software/anaconda/envs/temp/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 109, in rebuild_cuda_tensor
storage = storage_cls._new_shared_cuda(
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
Are you using a shared cluster/machine by any chance? The GPU may not be available if another application/user has taken control of it. You can check current GPU usage using the nvidia-smi command.
@ptrblck could you expand on what exclusive process vs shared mode entails? As far as I understand it’s a common practice in compute clusters to have the GPU set up in “exclusive process” mode and that’s not changeable by a user.
How does pytorch work when doing distributed work as opposed to the regular case?
The exclusive mode might be the right choice for your compute cluster and you can stick to it, if it’s working.
However, I would not recommend it as the default mode, if you are unsure about its limitations (single context creation) and are using your local workstation.
The recommended approach is to use DistributedDataParallel, with a single process per GPU.
Each device would thus create an own context.