Loss reduction sum vs mean: when to use each?

You would not only change the loss scale, but also the gradients:

# setup
model = nn.Linear(10, 10)
x = torch.randn(10, 10)
y = torch.randn(10, 10)

# mean
criterion = nn.MSELoss(reduction='mean')
out = model(x)
loss = criterion(out, y)
loss.backward()
print(model.weight.grad.abs().sum())
> tensor(5.6143)

# sum
model.zero_grad()
criterion = nn.MSELoss(reduction='sum')
out = model(x)
loss = criterion(out, y)
loss.backward()
print(model.weight.grad.abs().sum())
> tensor(561.4255)

I think the disadvantage in using the sum reduction would also be that the loss scale (and gradients) depend on the batch size, so you would probably need to change the learning rate based on the batch size. While this is surely possibly, a mean reduction would not make this necessary.

On the other hand, the none reduction gives you the flexibility to add any custom operations to the unreduced loss and you would either have to reduce it manually or provide the gradients in the right shape when calling backward on the unreduced loss.

9 Likes