Cross Entropy Loss different after expanding last dimension to VOCAB_SIZE

alessandro_mondin · March 20, 2023, 11:15am

Hello all,
I need to run Cross Entropy on a vocabulary prediction and I noticed that there is no consistency between the result of the suggestion by king Patrick here and my implementation that with a trivial & straightforward trick increases the last dimension to VOCAB_SIZE. Here below the code:

    def training_step(self, batch, batch_idx):
        img, question, answer = batch
        y_hat = F.softmax(self(img, question, answer), dim=-1)

        # PROBABLY RIGHT
        loss_suggested_by_ptrblck = self.loss(y_hat.permute(0, 2, 1), answer) # -> tensor(10.8249, grad_fn=<NllLoss2DBackward0>)
        # y_hat.permute(0, 2, 1) --> torch.Size([8, 50265, 5]) --> bs, vocab_size, seq_len
        # answer.shape --> torch.Size([8, 5]) --> bs, seq_len


        answer_logits = torch.zeros_like(y_hat)
        answer_logits = answer_logits.view(answer_logits.size(0) * answer_logits.size(1), -1)
        answer_logits[torch.arange(answer_logits.size(0)), answer.flatten()] = 1.0
        answer_logits = answer_logits.view(y_hat.size(0), -1, answer_logits.size(-1))

        assert torch.equal(torch.argmax(answer_logits, dim=-1), answer)
                                           
        loss = self.loss(y_hat, answer_logits) # tensor(0.0007, grad_fn=<DivBackward1>)
        # y_hat.shape --> torch.Size([8, 5, 50265]) --> bs, seq_len, vocab_size
        # answer_logits.shape --> torch.Size([8, 5, 50265]) --> bs, seq_len, vocab_size

        return loss

As you can see mine is ~ 0.0007 while Patrick’s is ~ 10.8
How is it possible?
Thanks a lot in advance