SLURM: RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable

dmpribak · November 25, 2023, 8:57am

I am trying to train EG3D on a slurm cluster using multiple gpus. But am getting the following error:

File "/home/dmpribak/ondemand/data/sys/myjobs/projects/default/4/train.py", line 395, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "/home/dmpribak/ondemand/data/sys/myjobs/projects/default/4/train.py", line 105, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "/home/dmpribak/.conda/envs/eg3d3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmpribak/.conda/envs/eg3d3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/home/dmpribak/.conda/envs/eg3d3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 163, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/dmpribak/.conda/envs/eg3d3/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/home/dmpribak/ondemand/data/sys/myjobs/projects/default/4/train.py", line 54, in subprocess_fn
    training_loop.training_loop(rank=rank, **c)
  File "/home/dmpribak/ondemand/data/sys/myjobs/projects/default/4/training/training_loop.py", line 196, in training_loop
    torch.distributed.broadcast(param, src=0)
  File "/home/dmpribak/.conda/envs/eg3d3/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmpribak/.conda/envs/eg3d3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
    work = default_pg.broadcast([tensor], opts)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I have the latest version of pytorch installed with cuda 12.1. torch.cuda.is_available() returns true and I am able to see both gpus printed with
for i in range(torch.cuda.device_count()): print(torch.cuda.get_device_properties(i).name)

Here is my nvidia-smi output:

Please let me know if any further information would be helpful in pinning down the cause of this

dmpribak · November 25, 2023, 9:04am

I’d also like to add that I am able to run training completely fine when I only use a single gpu

ptrblck · November 25, 2023, 5:07pm

Try disabling the “exclusive process” mode your GPUs are using right now and return the script.

dmpribak · November 25, 2023, 5:15pm

I have tried this, but I am running this on a university cluster and don’t have the permission to do so

ptrblck · November 25, 2023, 5:45pm

In this case you might need to ask an admin to change it to check if this is blocking the multiprocessing workload.

dmpribak · November 25, 2023, 5:53pm

I can ask. Is there anything else that might cause this issue that I could try in the meantime?

Howardyangyixuan · November 26, 2023, 7:56am

I also have this problem. In my case, I work with two A100 Driver Version: 535.129.03 CUDA Version: 12.1, Pytorch 2.1.0.dev20230621+cu117.

# simplest sample from Pytorch Get Started
Traceback (most recent call last):
  File "test.py", line 44, in <module>
    x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

before using Pytorch dev version I got this error

RuntimeError: CUDA error: operation not supported
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

kazu · April 30, 2024, 4:26am

I had the same error message as dmpribak’s when I ran a sample program of Distributed Data Parallel.
Disabling the exclusive process by nvidia-smi -c=0 has solved the problem!
Thanx.

raghavm1 · September 10, 2024, 6:01pm

I’m still facing this issue. All my 4 GPUs where I’m running DDP are in default mode and not exclusive mode.
My code throws this error after 1 day of execution, and I’m currently resuming from existing checkpoints.

The command nvidia-smi also seems to hang for me once this error occurs.

ptrblck · September 10, 2024, 8:52pm

Check for any Xids in dmesg as the error could be related to your setup (e.g. an overheating machine) and might be unrelated to PyTorch.

raghavm1 · September 13, 2024, 7:58pm

Hmm I just checked but nothing there in dmesg
any other logs or places you could suggest I check?

[Edit]: When I try to kill the job, it hangs and then I get these log messages

Message from syslogd@localhost at Sep 13 20:02:47 ...
 kernel:[82362.359551] watchdog: BUG: soft lockup - CPU#44 stuck for 74s! [nvidia-smi:1881056]

For context, I’m also doing multiprocessing operations for each batch before I pass the output to a GPU.

johann-petrak · April 17, 2025, 7:40am

Getting this intermittently again and again too and there is no indication what could be going on as the GPU is not actually used that much: only 1 of 4G are used according to nvidia-smi
This is extremely annoying and I have found quite a few reports on the internet where people have the same problem but no real reproducable solution or even an explanation of what is going on.