Why for NMT loss is not normalized by the number of tokens?

Guitaricet · September 22, 2019, 12:55am

I noticed that a lot of NMT implementations (including OpenNMT, Annonated Transformer, Attention-is-All-You-Need-Pytorch, …) do not normalize loss by the number of tokens (nor batch size).
Is there some specific reason for this?

line of code from Attention-is-All-You-Need-Pytorch repository:

github.com

jadore801120/attention-is-all-you-need-pytorch/blob/20f355eb655bad40195ae302b9d8036716be9a23/train.py#L50


        n_class = pred.size(1)


        one_hot = torch.zeros_like(pred).scatter(1, gold.view(-1, 1), 1)
        one_hot = one_hot * (1 - eps) + (1 - one_hot) * eps / (n_class - 1)
        log_prb = F.log_softmax(pred, dim=1)


        non_pad_mask = gold.ne(Constants.PAD)
        loss = -(one_hot * log_prb).sum(dim=1)
        loss = loss.masked_select(non_pad_mask).sum()  # average later
    else:
        loss = F.cross_entropy(pred, gold, ignore_index=Constants.PAD, reduction='sum')


    return loss




def train_epoch(model, training_data, optimizer, device, smoothing):
    ''' Epoch operation in training phase'''


    model.train()


    total_loss = 0

Note: it has comment “average later”, but this happens only after loss.backward() thus loss used for training and logging loss are quite different

line of code from ONMT repository:

github.com

OpenNMT/OpenNMT-py/blob/8cd68bcff849b2a4676bde720de6f80a239bfafe/onmt/utils/loss.py#L45


        len(tgt_field.vocab), opt.copy_attn_force,
        unk_index=unk_idx, ignore_index=padding_idx
    )
elif opt.label_smoothing > 0 and train:
    criterion = LabelSmoothingLoss(
        opt.label_smoothing, len(tgt_field.vocab), ignore_index=padding_idx
    )
elif isinstance(model.generator[-1], LogSparsemax):
    criterion = SparsemaxLoss(ignore_index=padding_idx, reduction='sum')
else:
    criterion = nn.NLLLoss(ignore_index=padding_idx, reduction='sum')


# if the loss function operates on vectors of raw logits instead of
# probabilities, only the first part of the generator needs to be
# passed to the NMTLossCompute. At the moment, the only supported
# loss function of this kind is the sparsemax loss.
use_raw_logits = isinstance(criterion, SparsemaxLoss)
loss_gen = model.generator[0] if use_raw_logits else model.generator
if opt.copy_attn:
    compute = onmt.modules.CopyGeneratorLossCompute(
        criterion, loss_gen, tgt_field.vocab, opt.copy_loss_by_seqlength,

line of code from Annonated Transformer:
http://nlp.seas.harvard.edu/2018/04/03/attention.html#label-smoothing

Abhilash_Srivastava · September 22, 2019, 6:35pm

Why do we want to normalize loss by the number of tokens? For logging purpose, it might be a good idea to view/save average loss, but for backward pass you can simply use the unnormalized loss.
If we are missing something, please share a side by side comparison of the two scenarios (normalized and unnormalized), to make the question clearer.

Guitaricet · September 24, 2019, 5:43pm

Probably, because it may be better not to have higher loss over longer sentences just because they are long. And according to Stanford CS224n lecture NMT objective is mean cross-entropy.
Also, averaging over batch dimension is needed to have batch-invariant learning rate.

I talked to some friends who did machine translation and it seems that this choice (of not averating) is pretty arbitrary. However, it tends to give a bit better results for no obvious reason.