Loss.backward() raises error 'grad can be implicitly created only for scalar outputs'

ajeetk · January 11, 2018, 9:25am

Hi,

loss.backward() do not go through while training and throws an error when on multiple GPUs using torch.nn.DataParallel

grad can be implicitly created only for scalar outputs

But, the same thing trains fine when I give only deviced_ids=[0] to torch.nn.DataParallel.
Is there something I am missing here?

Addendum:

While running on two gpus, the loss function returns a vector of 2 loss values. If I run the backward only on the first element of the vector it goes fine.

How can I make the backward function work with vector containing two or more loss values?

Thanks.

albanD · January 11, 2018, 10:19am

when you do loss.backward(), it is a shortcut for loss.backward(torch.Tensor([1])). This in only valid if loss is a tensor containing a single element.
DataParallel returns to you the partial loss that was computed on each gpu, so you usually want to do loss.backward(torch.Tensor([1, 1])) or loss.sum().backward(). Both will have the exact same behaviour.

kmyfoer · December 6, 2018, 7:40am

If I want to get average loss of each sample, and every single element in loss has been averaged on batch, should I use loss.mean().backward()?

albanD · December 6, 2018, 10:05am

If all the batch are the same size, it will work.

kmyfoer · December 7, 2018, 12:10pm

I got it. Thank you : )

gradwolf · July 10, 2019, 11:23pm

When I try loss.mean.backward() or loss.sum.backward() I am getting this warning? UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ’

How do I suppress this one?

ptrblck · July 11, 2019, 10:42am

Let’s continue the discussion in the topic you’ve created and have a look at @SimonW’s answer.

mxn.wls · September 25, 2019, 6:49pm

Why is backward limited in this way? Is it an oversight or some important reason?

Thanks!

albanD · September 25, 2019, 6:56pm

The gradient is computed as a vector jacobian product. So the size of the vector has to match the size of a dimension jacobian (which is the size of the output).

mxn.wls · September 25, 2019, 8:35pm

Sure, the number of grads needs to equal the number of variables.

What I meant was it seems weird that the “backward” function is defined as on 1 variable unless otherwise stated, even though it is implemented on a vector with > 1 variables.

e.g. if I have vec, a 2 element tensor of 2 variables, and call vec.backward() it won’t work, but if vec is a 1 element tensor it will. I can’t see an obvious reason why backward should be default limited to 1 variable (unless explicitly told otherwise), especially seeing as it is a method of the variable.

Is there a reason for this limitation?

albanD · September 25, 2019, 9:23pm

You can call it with more elements, it’s just that you have to specify the grads yourself.

Duo_Li · July 13, 2020, 1:38pm

can you write a example?

albanD · July 13, 2020, 1:43pm

For example, if you want to compute the gradient for the sum of the elements in x. You can do either:
x.sum().backward() or x.backward(torch.ones_like(x)).

Prograndma · October 30, 2024, 10:16pm

I know that this is an old thread, but it’s what showed up when I googled my error. The way I fixed it was different, and might help somebody else. The issue is that I admittedly copied code from somewhere online and it had:

nn.BCELoss(reduction='none')  # No reduction to allow masking

this line in it. My issue came from the reduction=‘none’ part. nn.BCELoss usually takes all the losses from a batch and whatever your output is and finds the average and returns that as a scalar. The reduction=‘none’ makes it so that doesn’t happen. So instead of a scalar I was getting a tensor. Hope this helps other people like me!