Why the two GPUs on my machine have the same ID, so that Pytorch can only choose one?

ShuokaiPan · August 21, 2017, 10:20am

Hi there,

I know my question is not very related to Pytorch, but I was trying to use pytorch on these two GPUs, so I was wondering if anyone could help me out.

I have two GPUs on my machine, one is Quadro k620 and one is Quadro k2200. But for some strange reason, they have the same physical ID. See the outputs from lshw and nvidia-smi.

When I use torch.cuda.device_count(), it tells me that two device are available. But I can only use quadro k2200, no matter how I set torch.cuda.device. I have also tested in tensorflow, and I can only selected gpu:0. If I tried to select gpu:1, it will report no device gpu:1 is available.

So does anyone know how can make the two GPUs have different IDs?

Thanks a lot for your help.
Shuokai

QuantScientist · August 21, 2017, 4:13pm

The output from your lshw command shows the SAME physical ID fo both of the GPU’s. Did you see that?

Run this code:

import torch
import sys
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA VERSION')
from subprocess import call
# call(["nvcc", "--version"]) does not work
! nvcc --version
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Devices')
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('Active CUDA Device: GPU', torch.cuda.current_device())

print ('Available devices ', torch.cuda.device_count())
print ('Current cuda device ', torch.cuda.current_device())

ShuokaiPan · August 21, 2017, 4:49pm

HI Solomon K,

Thanks a lot for your reply. Yes I do see that my two GPUs have the same physical ID, and this is exactly where I am confused about. Why are they assigned the same ID?

The output from your code is following:
Screenshot from 2017-08-21 17-44-24

Sorry the part of your code that run commandline commands does not work and I have not figure our how to make it work, so I just run them separately. You can see, it seems nvidia labels them differently, but lshw labels them as having the same ID. What could be wrong in this case?

Thanks,
Shuokai

QuantScientist · August 21, 2017, 5:48pm

Strange …
Run this and report the output:

github.com

QuantScientist/Deep-Learning-Boot-Camp/blob/master/docker/deps_nvidia_docker.sh

#!/usr/bin/env bash

apt-get install nvidia-modprobe

# curl -O -s https://raw.githubusercontent.com/minimaxir/keras-cntk-docker/master/deps_nvidia_docker.sh
if lspci | grep -i 'nvidia'
then
  echo "\nNVIDIA GPU is likely present."
else
  echo "\nNo NVIDIA GPU was detected. Exiting ...\n"
  exit 1
fi

echo "\nChecking for NVIDIA drivers ..."      
# Check for CUDA and try to install.
if ! dpkg-query -W cuda; then
  # The 16.04 installer works with 16.10.
  curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
  dpkg -i ./cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
  apt-get update

This file has been truncated. show original

ShuokaiPan · August 21, 2017, 10:46pm

Hi,

The output from your script is the following,

Seems to be similar to the outputs I got previously. Can you see if anything is wrong?

Cheers,
Shuokai

FuriouslyCurious · August 22, 2017, 4:58am

Reinstall NVidia driver in the HOST OS.

When you swap graphics card or remove and change anything physically, the driver does NOT update device info automatically. You have to rerun the driver installer.

Let me know if this solves your issue.

QuantScientist · August 22, 2017, 7:01am

@FuriouslyCurious may be right.

ShuokaiPan · August 22, 2017, 8:33am

HI guys,

Thanks a lot for your help. Yes, indeed I recently added another GPU. I will reinstall Nvidia driver now. But could I ask how to do this in Ubuntu? The problem is I also have Cuda 8.0 installed, will reinstalling Nvidia driver makes me reinstall Cuda 8.0? Please see the pic below to see Nvidia packages in my machine.

I have searched around, but the answers I found vary. Some say I should use this command,

`sudo apt-get remove --purge nvidia*`

Some says I should use,

sudo nvidia-uninstall

To reinstall, I just use,

sudo add-apt-repository ppa:graphics-drivers/ppa

Is this the correct way? I guess my Cuda 8.0 will not be removed during the process?

Sorry I am relatively new to ubuntu and want confirm before I do this.

Cheers

ShuokaiPan · August 22, 2017, 8:59am

HI guys,

I have reinstalled the nvidia driver following the my previous post. But the problem remains. I can still only use one GPU, and lshw gives the info exactly as before. Both GPU have the same physical ID, although nvidia-smi labels them differently.

I also reboot my machine after reinstall Nvidia driver. Strange…
What could I do then?
Cheers

QuantScientist · August 22, 2017, 9:27am

Can you change their location in the motherboard?

ShuokaiPan · August 22, 2017, 9:29am

THere are only two available slots for PCI 16, so I switch the position for them?

ShuokaiPan · August 22, 2017, 9:32am

Also when use the following command,

ubuntu-drivers devices

Here is what I got,

It seems Ubuntu can only find one GPU which is K2200 in this case.

FuriouslyCurious · August 28, 2017, 3:46am

@ShuokaiPan Don’t bother with NVidia repositories and apt-get. There is a better way.

Download the Ubuntu 16.04 drivers from NVidia website here:
http://www.nvidia.com/object/unix.html
It is a “run” file, so you just run it with bash and it will do a full uninstall and reinstall of the driver.

Then download NVidia CUDA Docker image here and run “nvidia-smi” inside the Docker image:
https://hub.docker.com/r/nvidia/cuda/tags/

Let us know what the output shows.

ShuokaiPan · August 28, 2017, 9:11pm

Hi @FuriouslyCurious

Thanks for the help. I will try that when convenient these days. But what is the difference between installing CUDA on ubuntu natively and installing the CUDA docker image?

Cheers

QuantScientist · August 28, 2017, 9:31pm

As I stated above, if you use nvidia-docker (as opposed to plain docker), you only need the drivers on the HOST machine.
Refer to my scripts above again.

Moon_Lockwood · August 7, 2023, 6:44pm

If anyone else runs across this, I think that they are on separate pci busses (pci@0000:03 and 04) and each is device 0 on it’s bus (03:00.0 and 04:00.0). I might be wrong!