Cross Entropy Loss Math under the hood

@ptrblck could you help me?
Hi everyone!

Please, someone could explain the math under the hood of Cross Entropy Loss in PyTorch?
I was performing some tests here and result of the Cross Entropy Loss in PyTorch doesn’t match with the result using the expression below:
image
I took some examples calculated using the expression above and executed it using the Cross Entropy Loss in PyTorch and the results were not the same.

I am trying this example here using Cross Entropy Loss from PyTorch:

probs1 = torch.tensor([[
    [
      [ 0.1, 0.3],
      [0.4, 0.5]
    ],
    [
      [0.2, 0.3],
      [0.4, 0.5]
    ],
    [
      [0.7, 0.4],
      [0.2, 0.0]
    ]
  ]])

target = torch.tensor([
    [
      [2, 2],
      [0, 1]
    ]
  ])

print(torch.sum(probs1, dim=0))
print(probs1.shape)

criterion = nn.CrossEntropyLoss()
loss = criterion(probs1, target)
loss

result -> tensor(0.9488)

Each pixel along the 3 channels corresponds to a probability distribution…there is a probability distribution for each position of the tensor…and the target has the classes for each distribution.
How can I know if this loss is beign computed correctly?

Best regards,

Matheus Santos.

Hello Matheus!

The issue is that pytorch’s CrossEntropyLoss doesn’t exactly match
the conventional definition of cross-entropy that you gave above.

Rather, it expects raw-score logits as it inputs, and, in effect, applies
softmax() to the logits internally to convert them to probabilities.
(CrossEntropyLoss might better have been named
CrossEntropyWithLogitsLoss.)

To check this, you could apply the logit function, log (p / (1 - p))
to convert your probs1 tensor, and then run that through
CrossEntropyLoss.

Best.

K. Frank

1 Like

Note that you are not using nn.CrossEntropyLoss correctly, as this criterion expects logits and will apply F.log_softmax internally, while probs already contains probabilities, as @KFrank explained.

So, let’s change the criterion to nn.NLLLoss and apply the torch.log manually.
This approach is just to demonstrate the formula and shouldn’t be used, as torch.log(torch.softmax()) is less numerically stable than F.log_softmax.

Also, the default reduction for the criteria in PyTorch will calculate the average over the observations, so lets use reduction='sum'.

Given that you’ll get:

criterion = nn.NLLLoss(reduction='sum')
loss = criterion(torch.log(probs1), target)

The manual approach from your formula would correspond to:

# Manual approach using your formula
one_hot = F.one_hot(target, num_classes = 3)
one_hot = one_hot.permute(0, 3, 1, 2)
ce = (one_hot * torch.log(probs1 + 1e-7))[one_hot.bool()]
ce = -1 * ce.sum()

While the manual approach from the PyTorch docs would give you:

# Using the formula from the docs
loss_manual = -1 * torch.log(probs1).gather(1, target.unsqueeze(1))
loss_manual = loss_manual.sum()

We should get the same results:

print(loss, ce, loss_manual)
> tensor(2.8824) tensor(2.8824) tensor(2.8824)

which looks correct.

1 Like

Hey, thank you so much for all explanations here @KFrank and @ptrblck !!!

I was performing some tests using tensors with lower dimensions to ensure that the loss result is correct and, due to this, expand to tensors with higher dimensions and do not worry about the possibility that the loss value is wrong.

I was actually using nn.CrossEntropyLoss () in the wrong way, I apologize for that.
Now I understood how to use it !!

Thank you for your time!

Best regards,

Matheus Santos.