I am trying to use nn.NLLLoss, so the code is:
loss = nn.NLLLoss()
loss_by_torch = loss(predictions_logp, actual_tokens)
There is another method to compute it:
loss_by_gather = -torch.mean(torch.gather(predictions_logp, dim=1, index=actual_tokens[:,None]))
And another one by using a function like this:
def compute_NLLLoss(logs, targets):
out = torch.zeros_like(targets, dtype=torch.float)
for i in range(len(targets)):
out[i] = logs[i][targets[i]]
return -torch.mean(out)
loss_by_func = compute_NLLLoss(predictions_logp, actual_tokens)
So, loss_by gather and loss_by_func have absolutely the same values, while loss_by_torch differs.
In one minibatch the shape of predictions_logp is [87312, 85], and the difference between loss_by_gather (or loss_by_func) and loss_by_torch varies from -0.001 to 0.001.
Does torch implementation use different formulae?