I am training a CNN on top of the features from a pretrained network for doing semantic segmentation. I have these models on two different GPUs and I copy the data to the respective GPUs as shown in the following snippet.
# move pretrained model from CPU to GPU 1
backbone = backbone.cuda(1)
# move the current model to GPU2
model = model.cuda(2)
# NLL criterion
criterion = nn.NLLLoss2d().cuda(2)
# IN training loop
x = Variable(t_rgb[i * 4 : i * 4 + cbs].type(dtype).cuda(1), requires_grad=False)
y = Variable(t_target[i * 4 : i * 4 + cbs].type(th.LongTensor).cuda(2), requires_grad=False)
x = backbone.forward(x).type(dtype).cuda(2)
output = model(x)
print ('x ', x.get_device())
print ('output ',output.get_device())
print ('y ',y.get_device())
loss = criterion(output, y)
print ('loss ',loss.get_device())
optimizer.zero_grad()
loss.backward()
The output I get is,
x 2
output 2
y 2
loss 2
But then I get
RuntimeError: Assertion
THCTensor_(checkGPU)(state, 3, input, gradInput, gradOutput)’ failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /b/wheel/pytorch-src/torch/lib/THCUNN/generic/Threshold.cu:49`
The exact stacktrace is as follows.
File "training.py", line 319, in <module>
loss.backward()
File "/opt/python3.5/lib/python3.5/site-packages/torch/autograd/variable.py", line 146, in backward
self._execution_engine.run_backward((self,), (gradient,), retain_variables)
File "/opt/python3.5/lib/python3.5/site-packages/torch/nn/_functions/thnn/auto.py", line 175, in backward
update_grad_input_fn(self._backend.library_state, input, grad_output, grad_input, *gi_args)
RuntimeError: Assertion `THCTensor_(checkGPU)(state, 3, input, gradInput, gradOutput)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /b/wheel/pytorch-src/torch/lib/THCUNN/generic/Threshold.cu:49
From what I understand, input to the model (Variable x) and the model (and so the weights and the gradient) are both on the same GPU (2 in my case). But I don’t understand why I am getting this error. Could someone please help me in figuring out the issue here? Also, I welcome any suggestions related to having models in different GPUs and moving data around.