Hello all,
I need to run Cross Entropy on a vocabulary prediction and I noticed that there is no consistency between the result of the suggestion by king Patrick here and my implementation that with a trivial & straightforward trick increases the last dimension to VOCAB_SIZE. Here below the code:
def training_step(self, batch, batch_idx):
img, question, answer = batch
y_hat = F.softmax(self(img, question, answer), dim=-1)
# PROBABLY RIGHT
loss_suggested_by_ptrblck = self.loss(y_hat.permute(0, 2, 1), answer) # -> tensor(10.8249, grad_fn=<NllLoss2DBackward0>)
# y_hat.permute(0, 2, 1) --> torch.Size([8, 50265, 5]) --> bs, vocab_size, seq_len
# answer.shape --> torch.Size([8, 5]) --> bs, seq_len
answer_logits = torch.zeros_like(y_hat)
answer_logits = answer_logits.view(answer_logits.size(0) * answer_logits.size(1), -1)
answer_logits[torch.arange(answer_logits.size(0)), answer.flatten()] = 1.0
answer_logits = answer_logits.view(y_hat.size(0), -1, answer_logits.size(-1))
assert torch.equal(torch.argmax(answer_logits, dim=-1), answer)
loss = self.loss(y_hat, answer_logits) # tensor(0.0007, grad_fn=<DivBackward1>)
# y_hat.shape --> torch.Size([8, 5, 50265]) --> bs, seq_len, vocab_size
# answer_logits.shape --> torch.Size([8, 5, 50265]) --> bs, seq_len, vocab_size
return loss
As you can see mine is ~ 0.0007 while Patrick’s is ~ 10.8
How is it possible?
Thanks a lot in advance