SLURM: RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable

I am trying to train EG3D on a slurm cluster using multiple gpus. But am getting the following error:

File "/home/dmpribak/ondemand/data/sys/myjobs/projects/default/4/train.py", line 395, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "/home/dmpribak/ondemand/data/sys/myjobs/projects/default/4/train.py", line 105, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "/home/dmpribak/.conda/envs/eg3d3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmpribak/.conda/envs/eg3d3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/home/dmpribak/.conda/envs/eg3d3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 163, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/dmpribak/.conda/envs/eg3d3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/home/dmpribak/ondemand/data/sys/myjobs/projects/default/4/train.py", line 54, in subprocess_fn
    training_loop.training_loop(rank=rank, **c)
  File "/home/dmpribak/ondemand/data/sys/myjobs/projects/default/4/training/training_loop.py", line 196, in training_loop
    torch.distributed.broadcast(param, src=0)
  File "/home/dmpribak/.conda/envs/eg3d3/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmpribak/.conda/envs/eg3d3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
    work = default_pg.broadcast([tensor], opts)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I have the latest version of pytorch installed with cuda 12.1. torch.cuda.is_available() returns true and I am able to see both gpus printed with
for i in range(torch.cuda.device_count()): print(torch.cuda.get_device_properties(i).name)

Here is my nvidia-smi output:

Please let me know if any further information would be helpful in pinning down the cause of this

I’d also like to add that I am able to run training completely fine when I only use a single gpu

Try disabling the “exclusive process” mode your GPUs are using right now and return the script.

I have tried this, but I am running this on a university cluster and don’t have the permission to do so

In this case you might need to ask an admin to change it to check if this is blocking the multiprocessing workload.

I can ask. Is there anything else that might cause this issue that I could try in the meantime?

1 Like

I also have this problem. In my case, I work with two A100 Driver Version: 535.129.03 CUDA Version: 12.1, Pytorch 2.1.0.dev20230621+cu117.

# simplest sample from Pytorch Get Started
Traceback (most recent call last):
  File "test.py", line 44, in <module>
    x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

before using Pytorch dev version I got this error

RuntimeError: CUDA error: operation not supported
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I had the same error message as dmpribak’s when I ran a sample program of Distributed Data Parallel.
Disabling the exclusive process by nvidia-smi -c=0 has solved the problem!
Thanx.

I’m still facing this issue. All my 4 GPUs where I’m running DDP are in default mode and not exclusive mode.
My code throws this error after 1 day of execution, and I’m currently resuming from existing checkpoints.

The command nvidia-smi also seems to hang for me once this error occurs.

Check for any Xids in dmesg as the error could be related to your setup (e.g. an overheating machine) and might be unrelated to PyTorch.

Hmm I just checked but nothing there in dmesg
any other logs or places you could suggest I check?

[Edit]: When I try to kill the job, it hangs and then I get these log messages

Message from syslogd@localhost at Sep 13 20:02:47 ...
 kernel:[82362.359551] watchdog: BUG: soft lockup - CPU#44 stuck for 74s! [nvidia-smi:1881056]

For context, I’m also doing multiprocessing operations for each batch before I pass the output to a GPU.