LogSoftmax & NLLLoss - Why is loss not zero?

When using the LogSoftmax & NLLLoss pair, why doesn’t a “one hot” input of the correct category produce a loss of zero? I suspect I’m missing something.

Variation of the example from the docs for NLLLoss:

m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()
# input is of size N x C = 1 X 3
# Input is a perfectly matching on-hot for category 0
input = torch.tensor([[1, 0, 0]], dtype=torch.float)
# We want category 0, so we should be right on target
target = torch.tensor([0])
output = loss(m(input), target)
output

Result: tensor(0.5514)

In this post, @ptrblck noted:

nn.NLLLoss expects the inputs to be log probabilities

Let’s use his trick to undo the log:

m(input).exp()

Result: tensor([[0.5761, 0.2119, 0.2119]])

The above is exactly what we’d get if we apply Softmax (without Log) directly, which is good, but the above don’t seem to be probabilities, at least not those that give us a zero loss.

Let’s try log probabilities directly:

lp = torch.tensor([[1.0, 0, 0]]).log()
print(lp)
loss(lp, target)

Result is what we’d expect: a loss of zero:

tensor([[0., -inf, -inf]])
tensor(0.)

The above is a simplified version of the MNIST example.

The effect of this behavior is that we get a loss even for matches, which seems to cause weights to grow slowly without bound.

What am I doing wrong? Thanks!

As you’ve already explained, the input contains logits, not probabilities.
The probabilities are shown in:

m(input).exp()
> tensor([[0.5761, 0.2119, 0.2119]])

which doesn’t give a 1 for the target class and zeros otherwise.

If you want to drive the loss towards zero, you would have to pass a large positive logit to input[0] as seen here:


for factor in [1, 5, 10, 100]:
    input = torch.zeros(1, 3)
    input[0, 0] = factor * 1.
    output = loss(m(input), target)
    print(F.softmax(input, dim=1))
    print(output)

> tensor([[0.5761169195, 0.2119415700, 0.2119415700]])
tensor(0.5514446497)
tensor([[0.9867032766, 0.0066483542, 0.0066483542]])
tensor(0.0133859031)
tensor([[9.9990916252e-01, 4.5395805500e-05, 4.5395805500e-05]])
tensor(9.0833353170e-05)
tensor([[1.0000000000e+00, 3.7835058537e-44, 3.7835058537e-44]])
tensor(0.)

@ptrblck, thanks for the many answers you generously provide in these forums, especially for us newbies.

I suspect I’m confusing the range of unit outputs (0-1) with the “raw” linear layer dot-products. Somewhere I read that biological neurons seldom saturate. So, I kind of expected the “raw” value to also be around 1 for a “match” (which it might be for old-school squared-error loss networks.)

Your explanation suggests that LogSoftmax and NLLLoss networks work differently: they evidently drive the output linear layer to produce large logit-like dot products. This would seem to require that output units learn weights which are large relative to the number of inputs. (E.g. 5 weights of 20 each, when presented with inputs of ~1, would produce a logit-like dot product of 100.) Is this roughly correct?

Have folks found that having extreme-value logit-like linear layers works better than more moderate (-1, 1) values? (Isn’t the one just a linear rescaling of the other?) Anyone know of literature on this?

And, if I did want to play with an output layer that learns weights that produce ~(-1, 1) “raw” values (before activation), which PyTorch activation & cost functions are preferred? Back to sigmoid & squared-error or is there something more modern? Softmax is nice because it produces probability-like values. Is there something that will produce ratios from ~(-1, 1) dot products? (I couldn’t find a “HardMax” module…)

I don’t think it works differently, but removes the softmax from the model for better numerical stability.
If you take a look at the applied loss function in nn.CrossEntropyLoss, you see that the softmax is calculated first and the log is applied on its result.
While this works theoretically, you could run into numerical issues due to the limited floating point precision.
Instead of directly using the mentioned function and adding the softmax layer to the model (which would then output “probabilities” in the range [0, 1]), you are instead passing the activations of the last linear layer, and apply F.log_softmax in nn.CrossEntropyLoss.
If you want to calculate the probabilities, you could of course apply F.softmax to the output and e.g. print it for debugging purposes. Don’t pass these F.softmax outputs to nn.CrossEntropyLoss, since internally F.log_softmax will be applied on them.