5090 RTX fail to initialize in pytorch

skaae · May 13, 2025, 11:31am

Hi,

I’m tying to build a pytorch training machine with 4x5090 RTX cards, but are running into some issues with getting pytorch to find the cards.

First the cards are found fine with nvidia smi:

Tue May 13 11:19:55 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.144                Driver Version: 570.144        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   36C    P0             81W /  600W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090        Off |   00000000:21:00.0 Off |                  N/A |
|  0%   34C    P0             60W /  600W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 5090        Off |   00000000:C1:00.0 Off |                  N/A |
|  0%   33C    P0             65W /  600W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 5090        Off |   00000000:E1:00.0 Off |                  N/A |
|  0%   31C    P0             65W /  600W |       0MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

But cuda is not available in torch and fails with

>>> import torch

>>> torch.cuda.is_available()

/home/ansible/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)

return torch._C._cuda_getDeviceCount() > 0

False

We run Python 3.12.3 and torch 2.8.0.dev20250512+cu128. OS is Ubuntu 24.04.2 LTS with kernel 6.14.6. Its assembled on a [GENOA2D24G motherboard] (https://www.asrockrack.com/general/productdetail.asp?Model=GENOA2D24G-2L%2B#Specifications)

I wonder if theres anything obvious we can try? otherwise our next step would be to swap the 5090 with some 4090’s to see if that would work.

ashok_raghava · May 13, 2025, 12:51pm

download the cuda tool kit and the necessary drivers and restart the kernel

ptrblck · May 13, 2025, 1:07pm

Based on the error message it seems your driver cannot initialize the devices so make sure it was properly installed.

A CUDA toolkit is not needed if you install the PyTorch binaries as they ship with all needed CUDA runtime dependencies.

skaae · May 13, 2025, 2:38pm

Thanks, and yes, I agree its probably a driver issue.

I downloaded the driver from nvidias driver page: Driver Details | NVIDIA

and selected the MIT driver. I think i’ll try with a 4090 to check that some card will work.

adax · May 14, 2025, 2:25am

FWIW I am also running into this issue with this board - 4090s work fine, blackwell seems to have issues (still debugging)

adax · May 14, 2025, 6:07am

Same environment but setting visible devices to only 1 works fine (e.g. export CUDA_VISIBLE_DEVICES=0; python train.py …). Seems to error out in DDP? I can get past the original error by specifying up to 4 GPUs in CUDA_VISIBLE_DEVICES, but then I get a “CUDA error: an illegal memory access was encountered” error for 2 or more GPUs w/DDP.

Smoke test (another data point during debugging):
export CUDA_VISIBLE_DEVICES=0,1,2,3; python -c ‘import torch; torch.cuda.is_available()’
works fine, but then adding more than 4 GPUs fails:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4; python -c ‘import torch; torch.cuda.is_available()’
/home/adaboost/miniconda3/envs/mustango/lib/python3.10/site-packages/torch/cuda/init.py:181: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0

skaae · May 14, 2025, 8:32am

Thank you for the reply.

I tried setting CUDA_VISIBLE_DEVICES to both 0 and 0,1,3,4 and I still get

 CUDA_VISIBLE_DEVICES=0; python -c 'import torch; torch.cuda.is_available()'
/home/ansible/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0

Since that is working for you I probably messed up some of the cables.
I hope I can ask a few questions about your setup :)?

What kind of risercards did you use? We used MCIO PCIe gen5 Device Adapter x8/x16 – c-payne.com and with the lowest numbered MCIO port going into the MCIO connector closets to the power connector on the riser card.
Did you change any bios settings or flashed the bios? I changed 3 settings in bios:
1. disable IOMMU in advanced → AMD CBS → NBIO common options → IOMMU disabled
2. setting link width to 16 in advanced → AMD CBS → NBIO common options → SMU common options → xGMI
3. changing the MCIO from 4x4x/4x/4x to 16x in advanced → chipset configuration → pcie link width → MCIO

adax · May 14, 2025, 2:23pm

Same riser card (cpayne must be making a killing), with same cable MCIO order.

Running latest bios, but only changed PCIE link to 16x, didn’t touch IOMMU/xGMI.

I have been successfully running 8x4090s for a while on the same drivers / older pytorch version. I don’t think its a board issue. I also found others running into the same illegal memory error here (still open): Training/Fine-tuning fails with PyTorch 2.8 + 4x 5090 GPUs using DDP/FSDP/DeepSpeed · Issue #150734 · pytorch/pytorch · GitHub

skaae · May 14, 2025, 3:19pm

Thank you.
I’ll try reverting the bios changes and flash the bios to the newest version and check if that makes a difference

We have a bunch of other gpu servers that runs 4090’s. So as a last resort I think we can swap the 4x5090 with 4x4090.

adax · May 14, 2025, 3:32pm

As for the illegal memory issue, it seems to have resolved via upgrading nccl (from: Minimal train DDP crash using 2x5090 GPUs · Issue #19987 · ultralytics/ultralytics · GitHub)

pip install --upgrade nvidia-nccl-cu12

However, I’m still stuck on going beyond 4 cards - such an odd error

adax · May 16, 2025, 4:07pm

As an update, this is not a pytorch issue, but something with the drivers. I compiled CUDA samples and ran one of the multiGPU tests (cuda-samples/build/Samples/0_Introduction/simpleMultiGPU), setting CUDA_VISIBLE_DEVICES=0,1,2,3 and 0,1,2,3,4 which reproduced the init error:

Starting simpleMultiGPU
CUDA error at cuda-samples/Samples/0_Introduction/simpleMultiGPU/simpleMultiGPU.cu:100 code=3(cudaErrorInitializationError) “cudaGetDeviceCount(&GPU_N)”

At this point will start with a fresh OS/driver/pytorch install and see if it clears, otherwise will just have to wait for an updated driver release

Mario_Marietto · May 17, 2025, 5:01pm

Your error looks like mine :

but as you can see no one has been able to give a relevant help.

your error is :

/home/ansible/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)

return torch._C._cuda_getDeviceCount() > 0

False

my error is :

Did you run some cuda functions before calling NumCudaDevices() that might have already set an error?

Error 304: OS call failed or operation not supported on this OS (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)  return torch._C._cuda_getDeviceCount() > 0
False

They are exactly the same !

The only way to make work Stable diffusion is to use the nvidia driver and the linux-nvidia-libs vers. 525.78.01 (the latest version which works is the 535) and “torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113” and it works like a charme :

(pytorch) I have no name! @ marietto:/usr/home/marietto$ LD_PRELOAD="/compat/dummy-uvm.so" python3 -c 'import torch; print(torch.cuda.is_available())'

True

(pytorch) I have no name!@marietto:/usr/home/marietto$ LD_PRELOAD="/compat/dummy-uvm.so" python3 -c 'import torch; print(torch.cuda.get_device_name(0))'

NVIDIA GeForce RTX 2080 Ti

adax · May 20, 2025, 6:02am

Final update - fixed but had to revert to Linux kernel 6.5 w/Ubuntu 22.04. not sure what happened but at least my jobs are running now. Was based on a hint from this guy running a B200 and having similar issues:

skaae · May 21, 2025, 2:48pm

Hi,

We got our setup to work with 4090’s by adding nokaslr to kernel boot params.
In /etc/default/grub we changed:

GRUB_CMDLINE_LINUX_DEFAULT="nokaslr"

and then sudo update-grub and reboot.

The current setup is

Ubuntu 24.04.2 LTS
kernel: 6.14.6-061406-generic
Driver Version: 570.133.20 CUDA Version: 12.8 (MIT driver)

I’ll try with a 5090 now and post an update.

Thank you for helping out!