Should i do loss.backward() or loss.mean().backward()

Can someone please make it clear for me what is the difference the loss.backward() and loss.mean().backward()? confusion overloads :confused:
The loss is loss = criterion(output, label)

where/when should i do loss.backward and in what senario should i do loss.mean().backward()?
does it have anything to do with batchsize and number of GPUs that are being used?

can it cause a problem or is it wrong if I use one instead of another carelessly?

Thanks :slight_smile:


If your loss is not a scalar value, then you should certainly use either loss.mean() or loss.sum() to convert it to a scalar before calling the backward. Otherwise, it will cause an error as in the following example:

>>> loss = criterion(m(input), target)
>>> loss
tensor([1.2377, 0.6949, 0.6477], grad_fn=<BinaryCrossEntropyBackward>)
>>> loss.backward()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vahid/anaconda3/envs/py37/lib/python3.7/site-packages/torch/", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/vahid/anaconda3/envs/py37/lib/python3.7/site-packages/torch/autograd/", line 84, in backward
    grad_tensors = _make_grads(tensors, grad_tensors)
  File "/home/vahid/anaconda3/envs/py37/lib/python3.7/site-packages/torch/autograd/", line 28, in _make_grads
    raise RuntimeError("grad can be implicitly created only for scalar outputs")
RuntimeError: grad can be implicitly created only for scalar outputs
1 Like

For what I know most of the loss function of pytorch will return a scalar by default unless you mannually change the reduce parameter.


Why do you want add .mean() on loss function, if you call the method in torch.nn or torch.nn.functional it has already call .mean(). For example, if you use torch.nn.CrossEntropyLoss() as your loss function, and input two tensors, one is your model’s output, the other is correspond target, and the loss is also a tensor with only one element like this.

tensor(1.6975, device='cuda:1', grad_fn=<NllLoss2DBackward>)
... ...

And you can do backpropagation in your network normally.

Yes, that’s true. The important thing is that calling loss.backward() only works on scalar values. Using the reduction=mean or reduction=sum can take care of this. But sometimes, one may need to use reduction=none in order to apply different mean/sum along different dimensions, for example, applying mean over the batch but taking the sum over pixels.


Yes I agree, but these are all correct for one gpu case.
how about when we are using multiple gpu, should we stick to loss.backward() if our loss is scalar or we should use .mean.backward()?

1 Like

It does not matter how many GPUs you use. If loss is already a scalar, then you can just call backward loss.backward() but if it is not scalar, then you can convert that to a scalar and then call backward: loss.mean().backward(). The number of GPUs do not play any role on how to call the backward function.


Just FYI, we can also use .mean().backward(), whenever we define the loss inside our nn.model.
It usually being used for the cases that we have multiple gpus and dont want one of the GPUs be unbalance in terms of memory.
More can be found here


O interesting! So if loss is defined inside the model, then loss can have independent values across different GPUs, and in that case we would need to add them up (or use mean) to convert them to a scalar value, and then call .backward().


Your question is exactly the solution to my question.:upside_down_face::rofl: