Error: Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one

Here I wrote a network for semantics segmentation. I try to train it on a 4 cards machine and get some error. Is the reason is I must place my custome loss in all different cards ? Is there anyone have the same issue ?

here is error feed back:

File “train_densenet_model_2.py”, line 290, in
main()
File “train_densenet_model_2.py”, line 138, in main
train(train_loader, criterion, net, optimizer, curr_epoch, args, val_loader, visualize)
File “train_densenet_model_2.py”, line 185, in train
card_1_main_loss = criterion(outputs[4:8], gts_slice_1)
File “/home/zzx/ENV/conda_3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 325, in call
result = self.forward(*input, **kwargs)
File “train_densenet_model_2.py”, line 68, in forward
return self.nll_loss(F.log_softmax(inputs), targets)
File “/home/zzx/ENV/conda_3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 325, in call
result = self.forward(*input, **kwargs)
File “/home/zzx/ENV/conda_3/lib/python3.6/site-packages/torch/nn/modules/loss.py”, line 147, in forward
self.ignore_index, self.reduce)
File “/home/zzx/ENV/conda_3/lib/python3.6/site-packages/torch/nn/functional.py”, line 1051, in nll_loss
return torch._C.nn.nll_loss2d(input, target, weight, size_average, ignore_index, reduce)
RuntimeError: Assertion `THCTensor
(checkGPU)(state, 4, input, target, output, total_weight)’ failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /opt/conda/conda-bld/pytorch_1512397735612/work/torch/lib/THCUNN/generic/SpatialClassNLLCriterion.cu:69

1 Like

Here is part of my training code, my network has two aux branches.

        for inputs_slice, gts_slice in zip(inputs, gts):


            inputs_slice = Var(inputs_slice).cuda()
            # print(gts_slice.size(), gts_slice[0].size())
            gts_slice_0 = Var(gts_slice[:4]).cuda(device=0)
            gts_slice_1 = Var(gts_slice[4:8]).cuda(device=1)
            gts_slice_2 = Var(gts_slice[8:12]).cuda(device=2)
            gts_slice_3 = Var(gts_slice[12:]).cuda(device=3)

            optimizer.zero_grad()
            outputs, aux_1, aux_2 = net(inputs_slice)
            print(type(outputs), len(outputs), type(outputs[0]), outputs[:4].size(), gts_slice_0.size())

            assert outputs[0].size()[2:] == gts_slice_0.size()[2:]
            assert outputs[0].size()[0] == NUM_CLASSES

            card_0_main_loss = criterion(outputs[:4], gts_slice_0)
            card_0_aux_1_loss = criterion(aux_1[:4], gts_slice_0)
            card_0_aux_2_loss = criterion(aux_2[:4], gts_slice_0)
            card_0_loss = card_0_main_loss + 0.3 * card_0_aux_2_loss + 0.15* card_0_aux_1_loss

            card_1_main_loss = criterion(outputs[4:8], gts_slice_1)
            card_1_aux_1_loss = criterion(aux_1[4:8], gts_slice_1)
            card_1_aux_2_loss = criterion(aux_2[4:8], gts_slice_1)
            card_1_loss = card_1_main_loss + 0.3 * card_1_aux_2_loss + 0.15* card_1_aux_1_loss

            card_2_main_loss = criterion(outputs[8:12], gts_slice_2)
            card_2_aux_1_loss = criterion(aux_1[8:12], gts_slice_2)
            card_2_aux_2_loss = criterion(aux_2[8:12], gts_slice_2)
            card_2_loss = card_2_main_loss + 0.3 * card_2_aux_2_loss + 0.15* card_2_aux_1_loss

            card_3_main_loss = criterion(outputs[12:], gts_slice_3)
            card_3_aux_1_loss = criterion(aux_1[12:], gts_slice_3)
            card_3_aux_2_loss = criterion(aux_2[12:], gts_slice_3)
            card_3_loss = card_3_main_loss + 0.3 * card_3_aux_2_loss + 0.15* card_3_aux_1_loss

            card_0_loss.backward(retain_graph=True)
            card_1_loss.backward(retain_graph=True)
            card_2_loss.backward(retain_graph=True)
            card_3_loss.backward(retain_graph=True)

Your output tensor and gts_slice_1 tensor are not on the same GPU when you calculate criterion(output[4:8], gts_slice_1).

I think you are right.