Gpu devices: nvidia-smi and cuda.get_device_name() output appear inconsistent

prairie-guy · February 1, 2018, 4:02pm

I have three GPU’s and have been trying to set CUDA_VISIBLE_DEVICES in my environment, but am confused by the difference in the ordering of the gpu’s in nvidia-smi and torch.cuda.get_device_name Here is the output of both:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.34                 Driver Version: 387.34                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:02:00.0  On |                  N/A |
| 23%   38C    P8    17W / 250W |    811MiB / 12188MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Graphics Device     Off  | 00000000:03:00.0 Off |                  N/A |
| 34%   49C    P8    26W / 250W |      0MiB / 12058MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Graphics Device     Off  | 00000000:04:00.0 Off |                  N/A |
| 28%   40C    P8    24W / 250W |      0MiB / 12058MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

>>> torch.cuda.get_device_name(0)
'Graphics Device'
>>> torch.cuda.get_device_name(1)
'TITAN Xp'
>>> torch.cuda.get_device_name(2)
'Graphics Device'

I would have expected the device numbers to be consistent across these applications. If not, then what should I expect if I am using another library like keras or tensor-flow?

Thanks in advance.

albanD · February 1, 2018, 4:08pm

Hi,

The device numbering is consistent accros all applications, except nvidia-smi that ignores the CUDA_DEVICE_ORDER environment variable.
The problem is that by default the device ordering is FASTEST_FIRST while nvidia-smi uses PCI_BUS_ID.
To make your applications consistent with nvidia_smi, just add export CUDA_DEVICE_ORDER=PCI_BUS_ID to your bashrc (or equivalent) such that every application uses nvidia-smi's ordering.

prairie-guy · February 1, 2018, 4:26pm

@albanD you are awesome! That’s exactly what I needed. Can’t tell you how much time I spent on this. I’m sure I missed it, in cuda and/or Pytorch related documentation. Thanks again.

Muhammad_Sohail · April 30, 2019, 7:59am

thanks, it helped me.

raining_day513 · August 23, 2022, 7:03am

either echoing $PCI_BUS_ID or $CUDA_DEVICE_ORDER I got the same result: nothing? So I assume this won’t work in my case.

ptrblck · August 23, 2022, 7:31am

You are supposed to set the env variable CUDA_DEVICE_ORDER to PCI_BUS_ID as explained in the CUDA docs.
echo $CUDA_DEVICE_ORDER might not return anything if you haven’t exported it already.

Tendeiro2 · December 18, 2025, 12:30pm

Hello ptrblck, I am trying to use Torch with the MIGs on my A100 80GB. However, CUDA only detects physical GPUs based on their ID, not their UUID. Is there any way for Torch to detect the MIGs? I have been working on this for a long time and have yet to find a solution to my problem. Thank you.

ptrblck · December 18, 2025, 12:31pm

Sorry, but I don’t understand which part of CUDA you are referring to and what exactly this means or what kind of issue you are seeing. Could you describe your issue in more details, please?

Tendeiro2 · December 18, 2025, 12:42pm

Thank you for your reply — let me clarify my configuration with specific details.

I am using an NVIDIA A100 80 GB PCIe with MIG enabled, as well as an RTX A4000 on the same system.

Environment:

- NVIDIA driver: 550.144.03

- CUDA: 12.4

According to `nvidia-smi`, the A100 (GPU 0) is divided into three MIG instances, each with ~20 GB of memory.

However, within PyTorch:

- Devices are only exposed as numeric CUDA indices.

- `torch.cuda.device_count()` returns the number of physical GPUs, which in this case is 2, and does not detect MIGs as an option.

- `CUDA_VISIBLE_DEVICES` only accepts indices, not MIG UUIDs.

My problem is that I cannot reliably bind a PyTorch process to a specific MIG instance by UUID, because cuda.device_count device returns 0.

ptrblck · December 18, 2025, 9:13pm

That’s expected as “Multi-MIG” is not supported and you would need to use CUDA_VISIBLE_DEVICES=MIG-slice in any case as described in the MIG user guide.

That’s not the case, as given in the linked user guide. PyTorch itself also supports it as seen in this comment. If you are seeing any issues with this, could you try to use PYTORCH_NVML_BASED_CUDA_CHECK=0 and rerun your use case using the MIG slice as the visible device?

Tendeiro2 · December 29, 2025, 2:59pm

Thanks for the help.

I can confirm that the issue was related to the PyTorch version. After upgrading to a more recent PyTorch build with CUDA 11.8 using pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu118, I am now able to use a MIG UUID directly with CUDA_VISIBLE_DEVICES.

Running PyTorch with CUDA_VISIBLE_DEVICES=MIG-<UUID> correctly reports a single visible device, and torch.cuda.device_count() returns 1, with the device name corresponding to the selected MIG instance.

Everything now behaves as expected. If I run into any further issues related to this, I will reach out again.

Thanks!