I know my question is not very related to Pytorch, but I was trying to use pytorch on these two GPUs, so I was wondering if anyone could help me out.
I have two GPUs on my machine, one is Quadro k620 and one is Quadro k2200. But for some strange reason, they have the same physical ID. See the outputs from lshw and nvidia-smi.
When I use torch.cuda.device_count(), it tells me that two device are available. But I can only use quadro k2200, no matter how I set torch.cuda.device. I have also tested in tensorflow, and I can only selected gpu:0. If I tried to select gpu:1, it will report no device gpu:1 is available.
So does anyone know how can make the two GPUs have different IDs?
Thanks a lot for your reply. Yes I do see that my two GPUs have the same physical ID, and this is exactly where I am confused about. Why are they assigned the same ID?
Sorry the part of your code that run commandline commands does not work and I have not figure our how to make it work, so I just run them separately. You can see, it seems nvidia labels them differently, but lshw labels them as having the same ID. What could be wrong in this case?
When you swap graphics card or remove and change anything physically, the driver does NOT update device info automatically. You have to rerun the driver installer.
Thanks a lot for your help. Yes, indeed I recently added another GPU. I will reinstall Nvidia driver now. But could I ask how to do this in Ubuntu? The problem is I also have Cuda 8.0 installed, will reinstalling Nvidia driver makes me reinstall Cuda 8.0? Please see the pic below to see Nvidia packages in my machine.
I have reinstalled the nvidia driver following the my previous post. But the problem remains. I can still only use one GPU, and lshw gives the info exactly as before. Both GPU have the same physical ID, although nvidia-smi labels them differently.
I also reboot my machine after reinstall Nvidia driver. Strange…
What could I do then?
Cheers
@ShuokaiPan Don’t bother with NVidia repositories and apt-get. There is a better way.
Download the Ubuntu 16.04 drivers from NVidia website here: http://www.nvidia.com/object/unix.html
It is a “run” file, so you just run it with bash and it will do a full uninstall and reinstall of the driver.
Thanks for the help. I will try that when convenient these days. But what is the difference between installing CUDA on ubuntu natively and installing the CUDA docker image?
As I stated above, if you use nvidia-docker (as opposed to plain docker), you only need the drivers on the HOST machine.
Refer to my scripts above again.
If anyone else runs across this, I think that they are on separate pci busses (pci@0000:03 and 04) and each is device 0 on it’s bus (03:00.0 and 04:00.0). I might be wrong!