I noticed that a lot of NMT implementations (including OpenNMT, Annonated Transformer, Attention-is-All-You-Need-Pytorch, …) do not normalize loss by the number of tokens (nor batch size).
Is there some specific reason for this?
line of code from Attention-is-All-You-Need-Pytorch repository:
Note: it has comment “average later”, but this happens only after loss.backward() thus loss used for training and logging loss are quite different
Why do we want to normalize loss by the number of tokens? For logging purpose, it might be a good idea to view/save average loss, but for backward pass you can simply use the unnormalized loss.
If we are missing something, please share a side by side comparison of the two scenarios (normalized and unnormalized), to make the question clearer.
Probably, because it may be better not to have higher loss over longer sentences just because they are long. And according to Stanford CS224n lecture NMT objective is mean cross-entropy.
Also, averaging over batch dimension is needed to have batch-invariant learning rate.
I talked to some friends who did machine translation and it seems that this choice (of not averating) is pretty arbitrary. However, it tends to give a bit better results for no obvious reason.