Here I wrote a network for semantics segmentation. I try to train it on a 4 cards machine and get some error. Is the reason is I must place my custome loss in all different cards ? Is there anyone have the same issue ?
here is error feed back:
File “train_densenet_model_2.py”, line 290, in
main()
File “train_densenet_model_2.py”, line 138, in main
train(train_loader, criterion, net, optimizer, curr_epoch, args, val_loader, visualize)
File “train_densenet_model_2.py”, line 185, in train
card_1_main_loss = criterion(outputs[4:8], gts_slice_1)
File “/home/zzx/ENV/conda_3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 325, in call
result = self.forward(*input, **kwargs)
File “train_densenet_model_2.py”, line 68, in forward
return self.nll_loss(F.log_softmax(inputs), targets)
File “/home/zzx/ENV/conda_3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 325, in call
result = self.forward(*input, **kwargs)
File “/home/zzx/ENV/conda_3/lib/python3.6/site-packages/torch/nn/modules/loss.py”, line 147, in forward
self.ignore_index, self.reduce)
File “/home/zzx/ENV/conda_3/lib/python3.6/site-packages/torch/nn/functional.py”, line 1051, in nll_loss
return torch._C.nn.nll_loss2d(input, target, weight, size_average, ignore_index, reduce)
RuntimeError: Assertion `THCTensor(checkGPU)(state, 4, input, target, output, total_weight)’ failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /opt/conda/conda-bld/pytorch_1512397735612/work/torch/lib/THCUNN/generic/SpatialClassNLLCriterion.cu:69