Gpu devices: nvidia-smi and cuda.get_device_name() output appear inconsistent

I have three GPU’s and have been trying to set CUDA_VISIBLE_DEVICES in my environment, but am confused by the difference in the ordering of the gpu’s in nvidia-smi and torch.cuda.get_device_name Here is the output of both:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.34                 Driver Version: 387.34                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:02:00.0  On |                  N/A |
| 23%   38C    P8    17W / 250W |    811MiB / 12188MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Graphics Device     Off  | 00000000:03:00.0 Off |                  N/A |
| 34%   49C    P8    26W / 250W |      0MiB / 12058MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Graphics Device     Off  | 00000000:04:00.0 Off |                  N/A |
| 28%   40C    P8    24W / 250W |      0MiB / 12058MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
>>> torch.cuda.get_device_name(0)
'Graphics Device'
>>> torch.cuda.get_device_name(1)
'TITAN Xp'
>>> torch.cuda.get_device_name(2)
'Graphics Device'

I would have expected the device numbers to be consistent across these applications. If not, then what should I expect if I am using another library like keras or tensor-flow?

Thanks in advance.

1 Like

Hi,

The device numbering is consistent accros all applications, except nvidia-smi that ignores the CUDA_DEVICE_ORDER environment variable.
The problem is that by default the device ordering is FASTEST_FIRST while nvidia-smi uses PCI_BUS_ID.
To make your applications consistent with nvidia_smi, just add export CUDA_DEVICE_ORDER=PCI_BUS_ID to your bashrc (or equivalent) such that every application uses nvidia-smi's ordering.

10 Likes

@albanD you are awesome! That’s exactly what I needed. Can’t tell you how much time I spent on this. I’m sure I missed it, in cuda and/or Pytorch related documentation. Thanks again.

thanks, it helped me.

either echoing $PCI_BUS_ID or $CUDA_DEVICE_ORDER I got the same result: nothing? So I assume this won’t work in my case.

You are supposed to set the env variable CUDA_DEVICE_ORDER to PCI_BUS_ID as explained in the CUDA docs.
echo $CUDA_DEVICE_ORDER might not return anything if you haven’t exported it already.

1 Like