Loss.backward() raises error 'grad can be implicitly created only for scalar outputs'


loss.backward() do not go through while training and throws an error when on multiple GPUs using torch.nn.DataParallel

grad can be implicitly created only for scalar outputs

But, the same thing trains fine when I give only deviced_ids=[0] to torch.nn.DataParallel.
Is there something I am missing here?


While running on two gpus, the loss function returns a vector of 2 loss values. If I run the backward only on the first element of the vector it goes fine.

How can I make the backward function work with vector containing two or more loss values?



when you do loss.backward(), it is a shortcut for loss.backward(torch.Tensor([1])). This in only valid if loss is a tensor containing a single element.
DataParallel returns to you the partial loss that was computed on each gpu, so you usually want to do loss.backward(torch.Tensor([1, 1])) or loss.sum().backward(). Both will have the exact same behaviour.


If I want to get average loss of each sample, and every single element in loss has been averaged on batch, should I use loss.mean().backward()?

If all the batch are the same size, it will work.


I got it. Thank you : )

When I try loss.mean.backward() or loss.sum.backward() I am getting this warning? UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ’

How do I suppress this one?

Let’s continue the discussion in the topic you’ve created and have a look at @SimonW’s answer.