Loss reduction sum vs mean: when to use each?

I think the disadvantage in using the sum reduction would also be that the loss scale (and gradients) depend on the batch size, so you would probably need to change the learning rate based on the batch size. While this is surely possibly, a mean reduction would not make this necessary.

@ptrblck

But if your dataset is 10 elements, then with batch size 10, 1 epoch is 1 optimizer step that is the average gradient over your entire dataset, whereas batch size 1 is 10 optimizer steps from each element of your dataset.

With mean reduction you would need to train for 10 epochs to do the same number of similarly sized optimizer steps. You are effectively training on a dataset that is 1 / batch size times as big.

Using different batch sizes will not create the same training since the parameters will be updated in each step and you can search for literature describing the effect on the final training metric w.r.t. the batch size selection.
If the sum reduction works better for your use case, feel free to use it. The described effect on the learning rate scaling dependence on the batch size described in my post is still true.