Pytorch is unable to load model's parameters to a particular GPU

varsh · June 25, 2018, 3:19pm

I have logged in to a server which has 4 NVIDIA 1080 GPUs. I ran nvidia-smi and found out that the global memory of GPU 0 is almost full but other GPUs have lot of free global memory. The status is as follows:

I am trying to run the code snippet where the CNN is shallow with 3 convolution layers:

device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
cnn = CNN()
cnn = cnn.to(device)

Its clear that I want to run this on CUDA device 1 which has got ample global memory. But when I run it I am getting error in the line: cnn = cnn.to(device) as:

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCTensorRandom.cu:25

Why is this so? Can somebody help me? Thanks in advance.

Some details:
OS: Ubuntu server 16.04
pytorch version: 0.4.0
python version: 3.5
package manager: pip
CUDA version: 8.0.61
CUDNN version: 7.1.02

royboy · June 26, 2018, 1:10am

Pytorch will create context on gpu 0 regardless of which gpus you end up using. You should set CUDA_VISIBLE_DEVICES.

For more detailed discussion, this link is helpful https://github.com/pytorch/pytorch/issues/3477.