Loss reduction sum vs mean: when to use each?

pumplerod · March 23, 2021, 4:06am

I’m rather new to pytorch (and NN architecture in general). While experimenting with my model I see that the various Loss classes for pytorch will accept a reduction parameter (none | sum | mean) for example. The differences are rather obvious regarding what will be returned, but I’m curious when it would be useful to use sum as opposed to mean? Does it have an effect on the backprop during training? Or am I really only choosing between a large loss value or smaller (average) loss value for aesthetic reasons for human readability?

ptrblck · March 23, 2021, 5:19am

You would not only change the loss scale, but also the gradients:

# setup
model = nn.Linear(10, 10)
x = torch.randn(10, 10)
y = torch.randn(10, 10)

# mean
criterion = nn.MSELoss(reduction='mean')
out = model(x)
loss = criterion(out, y)
loss.backward()
print(model.weight.grad.abs().sum())
> tensor(5.6143)

# sum
model.zero_grad()
criterion = nn.MSELoss(reduction='sum')
out = model(x)
loss = criterion(out, y)
loss.backward()
print(model.weight.grad.abs().sum())
> tensor(561.4255)

I think the disadvantage in using the sum reduction would also be that the loss scale (and gradients) depend on the batch size, so you would probably need to change the learning rate based on the batch size. While this is surely possibly, a mean reduction would not make this necessary.

On the other hand, the none reduction gives you the flexibility to add any custom operations to the unreduced loss and you would either have to reduce it manually or provide the gradients in the right shape when calling backward on the unreduced loss.

pumplerod · March 23, 2021, 6:51am

Thank you @ptrblck

At the risk of exposing myself as a complete novice. Could you provide a layman’s understanding of what the impact of that large gradient has on the training process? Do I understand that this will essentially propagate a greater value back when adjusting the weights? And could this be balanced out with a much smaller learning rate? If my batch sizes are always the same will there be a noticeable impact on the ability to converge, because I am perhaps taking too large of a step when adjusting the weight values?

This little tidbit has really helped expose a slightly better understanding of the training process, so thank you very much for shining some light.

ptrblck · March 23, 2021, 8:00am

Yes and yes You could certainly reduce these large gradients by adjusting the learning rate, but note that the learning rate would then depend on the batch size. I.e. if you decide to use a smaller or larger batch size you would also have to change the learning rate.
The first figure in e.g. this blog post shows the effects of a large weight update on a simple use case.

If you use the same script, adjust the learning rate, I would assume that you should be able to let the model converge to the same final result as with the mean reduction.

pumplerod · March 24, 2021, 1:59am

Thank you. I really appreciate your contribution to this community.

chaslie · September 7, 2021, 8:54am

hi Ptrblck,

I have a similar issue, but when i use MSEloss(reduction=‘sum’) and MSEloss(reduction=‘mean’) I do not see the difference in the calculated loss function as the size of the batch after the first iteration eg:

criterion = torch.nn.MSELoss(reduction='sum')
criterion(out,real_data_I)
tensor(40190.8242, device='cuda:0', grad_fn=<MseLossBackward>)

criterion = torch.nn.MSELoss()
criterion(out,real_data_I)
tensor(0.2726, device='cuda:0', grad_fn=<MseLossBackward>)

The batchsize is 9 so i would expect the loss to be either 4,465.647 or 2.4534 depending on which is correct?

I am more suspicious of the 0.276 value as this looks too low for the start og the model, but I don’t understand why the error.

chaslie

ptrblck · September 7, 2021, 8:56am

I’m not sure I understand the issue correctly, but it seems you are concerned about the correctness of the reduction in the second approach? Could you post the shapes of out and real_data_I so that we could double check the calculation?

chaslie · September 7, 2021, 8:59am

hi ptrblk,

the shapes are [9,1,128,128].

chaslie

ptrblck · September 7, 2021, 9:13am

Thanks! Based on this shape, the loss calculation seems to be correct using the reduction='mean' setting:

40190.8242 / (9 * 1 * 128 * 128)
> 0.2725614705403646

chaslie · September 7, 2021, 9:19am

I realised that, doh. I thought the difference was based purely on batch size and not the size of the array

chaslie · September 7, 2021, 9:19am

thanks for taking the time out to deal with an idiot…

ptrblck · September 7, 2021, 9:22am

Ah, c’mon… we are all missing these “trivial” things all the time

chaslie · September 7, 2021, 9:27am

too kind my friend, too kind

Shreeyak · December 7, 2021, 6:41pm

We could remove the dependency on batch size by dividing the loss by batch size. Would that work?

criterion = nn.MSELoss(reduction='sum')
out = model(x)
loss = criterion(out, y)
loss = loss / x.shape[0]

ptrblck · December 7, 2021, 7:54pm

Yes, you could surely divide by the batch size in case you don’t want to divide by the number of elements for specific reasons.

Shreeyak · December 10, 2021, 9:46am

Thanks! I believe that for dense predictions (such as segmentation or depth estimation), the average value per pixel would not be too meaningful, since it will result in only a small change in value because of averaging so many pixels. It would be hard to capture subtle changes in the prediction. Calculating the value across an entire image should give a much better signal.

ptrblck · December 10, 2021, 9:52am

I don’t think the interesting difference is the actual range, as you could always increase or decrease the learning rate. The advantage of using the average of all elements would be to get a loss value, which would not depend on the shape (i.e. using a larger or smaller spatial size would yield approx. the same loss values assuming your model is flexible regarding the spatial shape).

Raccoon_2 · January 26, 2022, 9:59am

Both are equal if the dimensions are constant because you can change the learning rate.
But if the output dimensions are variable, both approaches will not be appropriate.
Instead, you need to do this

loss = F.mse_loss(y_pred,y_true,reduction='none')
loss = (loss/torch.as_tensor(loss.size()).prod().sqrt()).sum()

Torcione · March 16, 2022, 1:35pm

Hi, does the reduction = 'mean' normalization change between different loss functions?
For example I read here that for cross entropy loss the different normalization between sum and mean reduction is not fixed by the input size but just by the number of element in the batch N (if we set parameter weight=None).
When in the description says:

and N spans the minibatch dimension

Does it mean the number of images in a batch or the the number of images multiplied by their dimension?

ptrblck · March 16, 2022, 4:17pm

You’ve copied only part of the description as it mentions:

and N spans the minibatch dimension as well as d1,…,dk for the K-dimensional case

which means that the mean reduction is dividing by the target shape. A quick test also shows this:

criterion_mean = nn.CrossEntropyLoss(reduction='mean')
criterion_sum = nn.CrossEntropyLoss(reduction='sum')

output = torch.randn(2, 3, 224, 224)
target = torch.randint(0, 3, (2, 224, 224))

loss_mean = criterion_mean(output, target)
loss_sum = criterion_sum(output, target)

print(loss_mean - (loss_sum / target.nelement()))
# > tensor(0.)