Loss reduction sum vs mean: when to use each?

I’m rather new to pytorch (and NN architecture in general). While experimenting with my model I see that the various Loss classes for pytorch will accept a reduction parameter (none | sum | mean) for example. The differences are rather obvious regarding what will be returned, but I’m curious when it would be useful to use sum as opposed to mean? Does it have an effect on the backprop during training? Or am I really only choosing between a large loss value or smaller (average) loss value for aesthetic reasons for human readability?


You would not only change the loss scale, but also the gradients:

# setup
model = nn.Linear(10, 10)
x = torch.randn(10, 10)
y = torch.randn(10, 10)

# mean
criterion = nn.MSELoss(reduction='mean')
out = model(x)
loss = criterion(out, y)
> tensor(5.6143)

# sum
criterion = nn.MSELoss(reduction='sum')
out = model(x)
loss = criterion(out, y)
> tensor(561.4255)

I think the disadvantage in using the sum reduction would also be that the loss scale (and gradients) depend on the batch size, so you would probably need to change the learning rate based on the batch size. While this is surely possibly, a mean reduction would not make this necessary.

On the other hand, the none reduction gives you the flexibility to add any custom operations to the unreduced loss and you would either have to reduce it manually or provide the gradients in the right shape when calling backward on the unreduced loss.


Thank you @ptrblck

At the risk of exposing myself as a complete novice. Could you provide a layman’s understanding of what the impact of that large gradient has on the training process? Do I understand that this will essentially propagate a greater value back when adjusting the weights? And could this be balanced out with a much smaller learning rate? If my batch sizes are always the same will there be a noticeable impact on the ability to converge, because I am perhaps taking too large of a step when adjusting the weight values?

This little tidbit has really helped expose a slightly better understanding of the training process, so thank you very much for shining some light.

Yes and yes :slight_smile: You could certainly reduce these large gradients by adjusting the learning rate, but note that the learning rate would then depend on the batch size. I.e. if you decide to use a smaller or larger batch size you would also have to change the learning rate.
The first figure in e.g. this blog post shows the effects of a large weight update on a simple use case.

If you use the same script, adjust the learning rate, I would assume that you should be able to let the model converge to the same final result as with the mean reduction.


Thank you. I really appreciate your contribution to this community.

hi Ptrblck,

I have a similar issue, but when i use MSEloss(reduction=‘sum’) and MSEloss(reduction=‘mean’) I do not see the difference in the calculated loss function as the size of the batch after the first iteration eg:

criterion = torch.nn.MSELoss(reduction='sum')
tensor(40190.8242, device='cuda:0', grad_fn=<MseLossBackward>)

criterion = torch.nn.MSELoss()
tensor(0.2726, device='cuda:0', grad_fn=<MseLossBackward>)

The batchsize is 9 so i would expect the loss to be either 4,465.647 or 2.4534 depending on which is correct?

I am more suspicious of the 0.276 value as this looks too low for the start og the model, but I don’t understand why the error.


I’m not sure I understand the issue correctly, but it seems you are concerned about the correctness of the reduction in the second approach? Could you post the shapes of out and real_data_I so that we could double check the calculation?

hi ptrblk,

the shapes are [9,1,128,128].


Thanks! Based on this shape, the loss calculation seems to be correct using the reduction='mean' setting:

40190.8242 / (9 * 1 * 128 * 128)
> 0.2725614705403646
1 Like

I realised that, doh. I thought the difference was based purely on batch size and not the size of the array :exploding_head:

thanks for taking the time out to deal with an idiot…

1 Like

Ah, c’mon… we are all missing these “trivial” things all the time :wink:

too kind my friend, too kind :slight_smile:

We could remove the dependency on batch size by dividing the loss by batch size. Would that work?

criterion = nn.MSELoss(reduction='sum')
out = model(x)
loss = criterion(out, y)
loss = loss / x.shape[0]

Yes, you could surely divide by the batch size in case you don’t want to divide by the number of elements for specific reasons.

Thanks! I believe that for dense predictions (such as segmentation or depth estimation), the average value per pixel would not be too meaningful, since it will result in only a small change in value because of averaging so many pixels. It would be hard to capture subtle changes in the prediction. Calculating the value across an entire image should give a much better signal.

I don’t think the interesting difference is the actual range, as you could always increase or decrease the learning rate. The advantage of using the average of all elements would be to get a loss value, which would not depend on the shape (i.e. using a larger or smaller spatial size would yield approx. the same loss values assuming your model is flexible regarding the spatial shape).

Both are equal if the dimensions are constant because you can change the learning rate.
But if the output dimensions are variable, both approaches will not be appropriate.
Instead, you need to do this

loss = F.mse_loss(y_pred,y_true,reduction='none')
loss = (loss/torch.as_tensor(loss.size()).prod().sqrt()).sum()

Hi, does the reduction = 'mean' normalization change between different loss functions?
For example I read here that for cross entropy loss the different normalization between sum and mean reduction is not fixed by the input size but just by the number of element in the batch N (if we set parameter weight=None).
When in the description says:

and N spans the minibatch dimension

Does it mean the number of images in a batch or the the number of images multiplied by their dimension?

You’ve copied only part of the description as it mentions:

and N spans the minibatch dimension as well as d1,…,dk​ for the K-dimensional case

which means that the mean reduction is dividing by the target shape. A quick test also shows this:

criterion_mean = nn.CrossEntropyLoss(reduction='mean')
criterion_sum = nn.CrossEntropyLoss(reduction='sum')

output = torch.randn(2, 3, 224, 224)
target = torch.randint(0, 3, (2, 224, 224))

loss_mean = criterion_mean(output, target)
loss_sum = criterion_sum(output, target)

print(loss_mean - (loss_sum / target.nelement()))
# > tensor(0.)