Why the two GPUs on my machine have the same ID, so that Pytorch can only choose one?

Hi there,

I know my question is not very related to Pytorch, but I was trying to use pytorch on these two GPUs, so I was wondering if anyone could help me out.

I have two GPUs on my machine, one is Quadro k620 and one is Quadro k2200. But for some strange reason, they have the same physical ID. See the outputs from lshw and nvidia-smi.

When I use torch.cuda.device_count(), it tells me that two device are available. But I can only use quadro k2200, no matter how I set torch.cuda.device. I have also tested in tensorflow, and I can only selected gpu:0. If I tried to select gpu:1, it will report no device gpu:1 is available.

So does anyone know how can make the two GPUs have different IDs?

Thanks a lot for your help.
Shuokai

The output from your lshw command shows the SAME physical ID fo both of the GPU’s. Did you see that?

Run this code:

import torch
import sys
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA VERSION')
from subprocess import call
# call(["nvcc", "--version"]) does not work
! nvcc --version
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Devices')
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('Active CUDA Device: GPU', torch.cuda.current_device())

print ('Available devices ', torch.cuda.device_count())
print ('Current cuda device ', torch.cuda.current_device())

HI Solomon K,

Thanks a lot for your reply. Yes I do see that my two GPUs have the same physical ID, and this is exactly where I am confused about. Why are they assigned the same ID?

The output from your code is following:
Screenshot from 2017-08-21 17-44-24

Sorry the part of your code that run commandline commands does not work and I have not figure our how to make it work, so I just run them separately. You can see, it seems nvidia labels them differently, but lshw labels them as having the same ID. What could be wrong in this case?

Thanks,
Shuokai

Strange …
Run this and report the output:

Hi,

The output from your script is the following,

Seems to be similar to the outputs I got previously. Can you see if anything is wrong?

Cheers,
Shuokai

Reinstall NVidia driver in the HOST OS.

When you swap graphics card or remove and change anything physically, the driver does NOT update device info automatically. You have to rerun the driver installer.

Let me know if this solves your issue.

@FuriouslyCurious may be right.

1 Like

HI guys,

Thanks a lot for your help. Yes, indeed I recently added another GPU. I will reinstall Nvidia driver now. But could I ask how to do this in Ubuntu? The problem is I also have Cuda 8.0 installed, will reinstalling Nvidia driver makes me reinstall Cuda 8.0? Please see the pic below to see Nvidia packages in my machine.

I have searched around, but the answers I found vary. Some say I should use this command,

`sudo apt-get remove --purge nvidia*`

Some says I should use,

sudo nvidia-uninstall

To reinstall, I just use,

sudo add-apt-repository ppa:graphics-drivers/ppa

Is this the correct way? I guess my Cuda 8.0 will not be removed during the process?

Sorry I am relatively new to ubuntu and want confirm before I do this.

Cheers

HI guys,

I have reinstalled the nvidia driver following the my previous post. But the problem remains. I can still only use one GPU, and lshw gives the info exactly as before. Both GPU have the same physical ID, although nvidia-smi labels them differently.

I also reboot my machine after reinstall Nvidia driver. Strange…
What could I do then?
Cheers

Can you change their location in the motherboard?

THere are only two available slots for PCI 16, so I switch the position for them?

Also when use the following command,

ubuntu-drivers devices

Here is what I got,

It seems Ubuntu can only find one GPU which is K2200 in this case.

@ShuokaiPan Don’t bother with NVidia repositories and apt-get. There is a better way.

Download the Ubuntu 16.04 drivers from NVidia website here:
http://www.nvidia.com/object/unix.html
It is a “run” file, so you just run it with bash and it will do a full uninstall and reinstall of the driver.

Then download NVidia CUDA Docker image here and run “nvidia-smi” inside the Docker image:
https://hub.docker.com/r/nvidia/cuda/tags/

Let us know what the output shows.

Hi @FuriouslyCurious

Thanks for the help. I will try that when convenient these days. But what is the difference between installing CUDA on ubuntu natively and installing the CUDA docker image?

Cheers

As I stated above, if you use nvidia-docker (as opposed to plain docker), you only need the drivers on the HOST machine.
Refer to my scripts above again.

1 Like

If anyone else runs across this, I think that they are on separate pci busses (pci@0000:03 and 04) and each is device 0 on it’s bus (03:00.0 and 04:00.0). I might be wrong!