When calculate loss in model forward with multi-gpu training then get a tuple loss

Alva-2020 · July 22, 2020, 6:30pm

Hi everyone, when I use F.nn_loss() in model forward as above. Then I two GPUs to train the model in form of model = torch.nn.DataParallel(model).cuda(). I get a tuple speed_loss. Before I called loss.backward, I use torch.sum(speed_loss) to get a scalar. Is this right? The tuple is from two gpus calculation. I add them together then go backward directly.

ptrblck · July 25, 2020, 3:02am

The output of nn.DataParallel should be a single tensor on the default device and thus the target as well.
I’m not sure, if you are using a custom data parallel approach, but if you are calculating the loss based on the model output and target, you should get a tensor (also on the default device) and not a tuple.

Alva-2020 · August 11, 2020, 2:43am

Thanks very much, I fixed it.

gangqiang_hu · November 6, 2020, 2:17pm

I got the same problem, getting multiple loss. How do you solve that? Thanks a lot!

gangqiang_hu · November 7, 2020, 2:17am

OK…I fixed that. If you compute and return the loss inside the model, then you will get a loss list. Just average the loss list.