I have three GPU’s and have been trying to set CUDA_VISIBLE_DEVICES in my environment, but am confused by the difference in the ordering of the gpu’s in nvidia-smi and torch.cuda.get_device_name Here is the output of both:
I would have expected the device numbers to be consistent across these applications. If not, then what should I expect if I am using another library like keras or tensor-flow?
The device numbering is consistent accros all applications, except nvidia-smi that ignores the CUDA_DEVICE_ORDER environment variable.
The problem is that by default the device ordering is FASTEST_FIRST while nvidia-smi uses PCI_BUS_ID.
To make your applications consistent with nvidia_smi, just add export CUDA_DEVICE_ORDER=PCI_BUS_ID to your bashrc (or equivalent) such that every application uses nvidia-smi's ordering.
@albanD you are awesome! That’s exactly what I needed. Can’t tell you how much time I spent on this. I’m sure I missed it, in cuda and/or Pytorch related documentation. Thanks again.
You are supposed to set the env variable CUDA_DEVICE_ORDER to PCI_BUS_ID as explained in the CUDA docs. echo $CUDA_DEVICE_ORDER might not return anything if you haven’t exported it already.
Hello ptrblck, I am trying to use Torch with the MIGs on my A100 80GB. However, CUDA only detects physical GPUs based on their ID, not their UUID. Is there any way for Torch to detect the MIGs? I have been working on this for a long time and have yet to find a solution to my problem. Thank you.
Sorry, but I don’t understand which part of CUDA you are referring to and what exactly this means or what kind of issue you are seeing. Could you describe your issue in more details, please?
That’s expected as “Multi-MIG” is not supported and you would need to use CUDA_VISIBLE_DEVICES=MIG-slice in any case as described in the MIG user guide.
That’s not the case, as given in the linked user guide. PyTorch itself also supports it as seen in this comment. If you are seeing any issues with this, could you try to use PYTORCH_NVML_BASED_CUDA_CHECK=0 and rerun your use case using the MIG slice as the visible device?
I can confirm that the issue was related to the PyTorch version. After upgrading to a more recent PyTorch build with CUDA 11.8 using pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu118, I am now able to use a MIG UUID directly with CUDA_VISIBLE_DEVICES.
Running PyTorch with CUDA_VISIBLE_DEVICES=MIG-<UUID> correctly reports a single visible device, and torch.cuda.device_count() returns 1, with the device name corresponding to the selected MIG instance.
Everything now behaves as expected. If I run into any further issues related to this, I will reach out again.