You would not only change the loss scale, but also the gradients:
# setup
model = nn.Linear(10, 10)
x = torch.randn(10, 10)
y = torch.randn(10, 10)
# mean
criterion = nn.MSELoss(reduction='mean')
out = model(x)
loss = criterion(out, y)
loss.backward()
print(model.weight.grad.abs().sum())
> tensor(5.6143)
# sum
model.zero_grad()
criterion = nn.MSELoss(reduction='sum')
out = model(x)
loss = criterion(out, y)
loss.backward()
print(model.weight.grad.abs().sum())
> tensor(561.4255)
I think the disadvantage in using the sum
reduction would also be that the loss scale (and gradients) depend on the batch size, so you would probably need to change the learning rate based on the batch size. While this is surely possibly, a mean
reduction would not make this necessary.
On the other hand, the none
reduction gives you the flexibility to add any custom operations to the unreduced loss and you would either have to reduce it manually or provide the gradients in the right shape when calling backward
on the unreduced loss.