SLURM cluster CUDA error: all CUDA-capable devices are busy or unavailable

import-antigravity · March 12, 2021, 6:47pm

I’m using a pre-trained Inception network to get some GAN metrics, and I have the following code to move the model to a GPU for evaluation:

def __init__(self, ...):
    ...
    self._model = InceptionV3(...)
    if cuda.is_available():
        self._model.to('cuda')

When I run this code on an HPC, I get the following error:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

Is there something wrong with how I’m moving the model to a GPU? This code should work even if I have an arbitrary number of GPUs right?

import-antigravity · March 12, 2021, 9:56pm

Update, this error only happens when I request a node with more than 1 GPU

ptrblck · March 13, 2021, 4:14am

I guess your multi-GPU slurm setup isn’t working correctly and might be masking all available devices, so that PyTorch isn’t able to use any.
If you are trying to use e.g. 2 GPUs, you could run this small test:

for d in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_name(d))

If this doesn’t print anything, then your container might not be able to “see” any devices.

import-antigravity · March 13, 2021, 6:36pm

>>> for d in range(torch.cuda.device_count()):
...     print(torch.cuda.get_device_name(d))
...
GeForce RTX 2080 Ti
GeForce RTX 2080 Ti

import-antigravity · March 15, 2021, 5:33am

To reiterate, this only happens when I have more than one GPU.

ptrblck · March 15, 2021, 7:56am

I don’t know what causes this issue, as the Python script can apparently access both GPUs, while you are still getting the SLURM error, so I guess that this error is raised by your custom script?

import-antigravity · March 15, 2021, 6:27pm

Well, this happens whether I use an sbatch script or do an interactive session with srun. Both times I request 1 or 2 GPUs with --gres=gpu:n. There isn’t really room for something to go wrong with the script so I find it hard to imagine that could be the problem. It’s possible it’s an issue with pytorch-lightning but the problem only happens with my pre-trained Inception model which is just a nn.Module.

ptrblck · March 15, 2021, 7:41pm

Are both workflows failing (sbatch and srun)? If so, how did you get the valid GPU output?

Could you remove Lightning, if you think this might be the issue and run a “pure” PyTorch script?

import-antigravity · March 16, 2021, 1:14am

That was from a python shell in an interactive srun session.

ptrblck · March 16, 2021, 3:30am

Thanks for the update.
Unfortunately, I don’t understand when this error is triggered, since you were able to use both GPUs while running via srun, but nevertheless are also hitting the error when running through sbatch or srun.

If that’s the case, I would assume that your actual script is not working correctly or, as you suggested, Lightning or any other 3rd party library might cause problems.
Were you able to run a pure PyTorch script?

import-antigravity · March 16, 2021, 5:22pm

I’ll try and set aside some time today to try making a vanilla pytorch script. Thanks for helping me with this!

import-antigravity · April 5, 2021, 7:08pm

I haven’t been able to make a pure pytorch script reproducing the error since it would be non-trivial because the code involved with the error extends a pytorch-lightning class (which extends nn.Module), but someone in another forum has suggested that the problem is that Pytorch does not support self.to('cuda') in multi-GPU scenarios, and you have to specify with 'cuda:n' where n is the device index. Do you know if this is true?

ptrblck · April 5, 2021, 11:03pm

to('cuda') calls are supported in single-GPU and multi-GPU runs and will push the tensor or module to the default device.
If your distributed setup masks all devices and sets only one GPU to be visible, the to('cuda') operation will use this visible device and to('cuda:n') would fail.
On the other hand, if all devices are visible, you might recreate multiple CUDA contexts as also seems to be the issue in this topic.

import-antigravity · April 9, 2021, 7:14pm

I think I resolved it. I registered the Inception network as a part of my main model and then they were able to be on the same device.