Cuda not available when running multi-gpu inference

Hi, I am running an inference script on a server with 7 NVIDIA GeForce GTX 1080 Ti. It’s currently running on a single GPU, which works fine. However, I’d like to parallelize things (I am using huggingface accelerate for that, this might also be an issue with their tool…). When running my script with accelerate launch --config_file [my_config_file] run.py torch cannot find cuda anymore. I am in the same conda env that works fine for single GPU inference. The config file sets CUDA_VISIBLE_DEVICES which it does in my case to ‘[1,2,3,4]’ (On GPU ‘0’ I am running the above mentioned script). Same happens when specifically setting CUDA_VISIBLE_DEVICES=1,2,3,4 before calling accelerate [...].
I can’t seem to figure out why cuda is not available in this case.

  • Cuda driver version 11.8
  • nvcc -version prints nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Nov__3_21:07:56_CDT_2017 Cuda compilation tools, release 9.1, V9.1.85
  • torch.version.cuda is 11.8
  • pytorch version 2.0.0
  • dmesg does not have any entries regarding NVRM hinting at problems

Could you describe your issue in more detail and what “CUDA not available” means in this context given you can use your GPU?

After importing torch, torch.cuda.is_available() is false and then during the initialization of an accelerate.PartialState() which under the hood uses torch.nn.parallel.DistributedDataParallel if I’m not mistaken, torch raises an AttributeError ‘torch.cpu has no attribute device_count’. Which I guess is due to the cpu being torch’s fallback device when no gpus are available and then accelerate calling device_count (checked this during debugging) on that device.

I don’t know if accelerate uses another PyTorch binary but you could double check it via printing torch.__version__ and torch.__path__ inside the standalone run (which is able to detect the GPU) and inside the script used by accelerate.

Thanks @ptrblck. Turns out it was a corrupt conda config on my end. Seems to be resolved now.