I’m tying to build a pytorch training machine with 4x5090 RTX cards, but are running into some issues with getting pytorch to find the cards.
First the cards are found fine with nvidia smi:
Tue May 13 11:19:55 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.144 Driver Version: 570.144 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 Off | 00000000:01:00.0 Off | N/A |
| 0% 36C P0 81W / 600W | 0MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 Off | 00000000:21:00.0 Off | N/A |
| 0% 34C P0 60W / 600W | 0MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 5090 Off | 00000000:C1:00.0 Off | N/A |
| 0% 33C P0 65W / 600W | 0MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 5090 Off | 00000000:E1:00.0 Off | N/A |
| 0% 31C P0 65W / 600W | 0MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
But cuda is not available in torch and fails with
>>> import torch
>>> torch.cuda.is_available()
/home/ansible/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
Same environment but setting visible devices to only 1 works fine (e.g. export CUDA_VISIBLE_DEVICES=0; python train.py …). Seems to error out in DDP? I can get past the original error by specifying up to 4 GPUs in CUDA_VISIBLE_DEVICES, but then I get a “CUDA error: an illegal memory access was encountered” error for 2 or more GPUs w/DDP.
Smoke test (another data point during debugging):
export CUDA_VISIBLE_DEVICES=0,1,2,3; python -c ‘import torch; torch.cuda.is_available()’
works fine, but then adding more than 4 GPUs fails:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4; python -c ‘import torch; torch.cuda.is_available()’
/home/adaboost/miniconda3/envs/mustango/lib/python3.10/site-packages/torch/cuda/init.py:181: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
I tried setting CUDA_VISIBLE_DEVICES to both 0 and 0,1,3,4 and I still get
CUDA_VISIBLE_DEVICES=0; python -c 'import torch; torch.cuda.is_available()'
/home/ansible/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
Since that is working for you I probably messed up some of the cables.
I hope I can ask a few questions about your setup :)?
What kind of risercards did you use? We used MCIO PCIe gen5 Device Adapter x8/x16 – c-payne.com and with the lowest numbered MCIO port going into the MCIO connector closets to the power connector on the riser card.
Did you change any bios settings or flashed the bios? I changed 3 settings in bios:
disable IOMMU in advanced → AMD CBS → NBIO common options → IOMMU disabled
setting link width to 16 in advanced → AMD CBS → NBIO common options → SMU common options → xGMI
changing the MCIO from 4x4x/4x/4x to 16x in advanced → chipset configuration → pcie link width → MCIO
As an update, this is not a pytorch issue, but something with the drivers. I compiled CUDA samples and ran one of the multiGPU tests (cuda-samples/build/Samples/0_Introduction/simpleMultiGPU), setting CUDA_VISIBLE_DEVICES=0,1,2,3 and 0,1,2,3,4 which reproduced the init error:
Starting simpleMultiGPU
CUDA error at cuda-samples/Samples/0_Introduction/simpleMultiGPU/simpleMultiGPU.cu:100 code=3(cudaErrorInitializationError) “cudaGetDeviceCount(&GPU_N)”
At this point will start with a fresh OS/driver/pytorch install and see if it clears, otherwise will just have to wait for an updated driver release
but as you can see no one has been able to give a relevant help.
your error is :
/home/ansible/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
my error is :
Did you run some cuda functions before calling NumCudaDevices() that might have already set an error?
Error 304: OS call failed or operation not supported on this OS (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0
False
They are exactly the same !
The only way to make work Stable diffusion is to use the nvidia driver and the linux-nvidia-libs vers. 525.78.01 (the latest version which works is the 535) and “torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113” and it works like a charme :
(pytorch) I have no name! @ marietto:/usr/home/marietto$ LD_PRELOAD="/compat/dummy-uvm.so" python3 -c 'import torch; print(torch.cuda.is_available())'
True
(pytorch) I have no name!@marietto:/usr/home/marietto$ LD_PRELOAD="/compat/dummy-uvm.so" python3 -c 'import torch; print(torch.cuda.get_device_name(0))'
NVIDIA GeForce RTX 2080 Ti
Final update - fixed but had to revert to Linux kernel 6.5 w/Ubuntu 22.04. not sure what happened but at least my jobs are running now. Was based on a hint from this guy running a B200 and having similar issues: