5090 RTX fail to initialize in pytorch

Thank you for the reply.

I tried setting CUDA_VISIBLE_DEVICES to both 0 and 0,1,3,4 and I still get

 CUDA_VISIBLE_DEVICES=0; python -c 'import torch; torch.cuda.is_available()'
/home/ansible/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0

Since that is working for you I probably messed up some of the cables.
I hope I can ask a few questions about your setup :)?

  • What kind of risercards did you use? We used MCIO PCIe gen5 Device Adapter x8/x16 – c-payne.com and with the lowest numbered MCIO port going into the MCIO connector closets to the power connector on the riser card.

  • Did you change any bios settings or flashed the bios? I changed 3 settings in bios:

    1. disable IOMMU in advanced → AMD CBS → NBIO common options → IOMMU disabled
    2. setting link width to 16 in advanced → AMD CBS → NBIO common options → SMU common options → xGMI
    3. changing the MCIO from 4x4x/4x/4x to 16x in advanced → chipset configuration → pcie link width → MCIO