Hi, I am running an inference script on a server with 7 NVIDIA GeForce GTX 1080 Ti. It’s currently running on a single GPU, which works fine. However, I’d like to parallelize things (I am using huggingface accelerate for that, this might also be an issue with their tool…). When running my script with accelerate launch --config_file [my_config_file] run.py torch cannot find cuda anymore. I am in the same conda env that works fine for single GPU inference. The config file sets CUDA_VISIBLE_DEVICES which it does in my case to ‘[1,2,3,4]’ (On GPU ‘0’ I am running the above mentioned script). Same happens when specifically setting CUDA_VISIBLE_DEVICES=1,2,3,4 before calling accelerate [...].
I can’t seem to figure out why cuda is not available in this case.
- Cuda driver version 11.8
- nvcc -version prints
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Nov__3_21:07:56_CDT_2017 Cuda compilation tools, release 9.1, V9.1.85 - torch.version.cuda is 11.8
- pytorch version 2.0.0
- dmesg does not have any entries regarding NVRM hinting at problems