I am training a CNN on top of the features from a pretrained network for doing semantic segmentation. I have these models on two different GPUs and I copy the data to the respective GPUs as shown in the following snippet.
# move pretrained model from CPU to GPU 1 backbone = backbone.cuda(1) # move the current model to GPU2 model = model.cuda(2) # NLL criterion criterion = nn.NLLLoss2d().cuda(2) # IN training loop x = Variable(t_rgb[i * 4 : i * 4 + cbs].type(dtype).cuda(1), requires_grad=False) y = Variable(t_target[i * 4 : i * 4 + cbs].type(th.LongTensor).cuda(2), requires_grad=False) x = backbone.forward(x).type(dtype).cuda(2) output = model(x) print ('x ', x.get_device()) print ('output ',output.get_device()) print ('y ',y.get_device()) loss = criterion(output, y) print ('loss ',loss.get_device()) optimizer.zero_grad() loss.backward()
The output I get is,
But then I get
RuntimeError: AssertionTHCTensor_(checkGPU)(state, 3, input, gradInput, gradOutput)’ failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /b/wheel/pytorch-src/torch/lib/THCUNN/generic/Threshold.cu:49`
The exact stacktrace is as follows.
File "training.py", line 319, in <module> loss.backward() File "/opt/python3.5/lib/python3.5/site-packages/torch/autograd/variable.py", line 146, in backward self._execution_engine.run_backward((self,), (gradient,), retain_variables) File "/opt/python3.5/lib/python3.5/site-packages/torch/nn/_functions/thnn/auto.py", line 175, in backward update_grad_input_fn(self._backend.library_state, input, grad_output, grad_input, *gi_args) RuntimeError: Assertion `THCTensor_(checkGPU)(state, 3, input, gradInput, gradOutput)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /b/wheel/pytorch-src/torch/lib/THCUNN/generic/Threshold.cu:49
From what I understand, input to the model (Variable x) and the model (and so the weights and the gradient) are both on the same GPU (2 in my case). But I don’t understand why I am getting this error. Could someone please help me in figuring out the issue here? Also, I welcome any suggestions related to having models in different GPUs and moving data around.