Thank you for the reply.
I tried setting CUDA_VISIBLE_DEVICES to both 0 and 0,1,3,4 and I still get
CUDA_VISIBLE_DEVICES=0; python -c 'import torch; torch.cuda.is_available()'
/home/ansible/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
Since that is working for you I probably messed up some of the cables.
I hope I can ask a few questions about your setup :)?
-
What kind of risercards did you use? We used MCIO PCIe gen5 Device Adapter x8/x16 – c-payne.com and with the lowest numbered MCIO port going into the MCIO connector closets to the power connector on the riser card.
-
Did you change any bios settings or flashed the bios? I changed 3 settings in bios:
- disable IOMMU in advanced → AMD CBS → NBIO common options → IOMMU disabled
- setting link width to 16 in advanced → AMD CBS → NBIO common options → SMU common options → xGMI
- changing the MCIO from 4x4x/4x/4x to 16x in advanced → chipset configuration → pcie link width → MCIO