Logits vs. log-softmax

Zhihan_Yang · September 11, 2020, 10:47pm

In a classification task where the input can only belong to one class, the softmax function is naturally used as the final activation function, taking in “logits” (often from a preceeding linear layer) and outputting proper probabilities.

I am confused about the exact meaning of “logits” because many call them “unnormalized log-probabilities”. Yet they are different from applying log directly to the output of softmax, which are actual probabilities.

For example, consider the following experiment:

import torch
import torch.nn.functional as F

logits = torch.tensor([51., 50., 49., 48.])

print('Probas from logits:\n', F.softmax(logits, dim=0))

print('Log-softmax:\n', F.log_softmax(logits, dim=0))
print('Difference between logits and log-softmax:\n', logits - F.log_softmax(logits, dim=0))

print('Probas from log-softmax:\n', F.softmax(F.log_softmax(logits, dim=0), dim=0))

and its output:

Probas from logits:
 tensor([0.6439, 0.2369, 0.0871, 0.0321])
Log-softmax:
 tensor([-0.4402, -1.4402, -2.4402, -3.4402])
Difference between logits and log-softmax:
 tensor([51.4402, 51.4402, 51.4402, 51.4402])
Probas from log-softmax:
 tensor([0.6439, 0.2369, 0.0871, 0.0321])

We can see that 1) the difference between the logits and the result of log-softmax is a constant and 2) the logits and the result of log-softmax yield the same probabilities after applying softmax.

So, my question is, why do we have a designated function for log-softmax? Why would we ever need the log-softmax of logits?

KFrank · September 12, 2020, 2:23am

Hi Zhihan!

The short, practical answer is because of what you typically do with
the log-softmax of the logits. You pass them into a loss function such
as nll_loss(). (Doing this gives you, in effect, the cross-entropy loss.)

If you were to pass the raw logits into nll_loss() you would get an
ill-behaved loss function that is unbounded below. That is, by, for
example, making the biases of your last linear layer (that produces the
logits) arbitrarily large, the logits will become arbitrarily large, and the
loss function will become arbitrarily “good,” that is large and negative.

But why is this?

As you have noticed, the log() function is almost, but not quite the
inverse of the softmax() function – the difference being a constant
(across classes for a given set of logits).

This constant is the difference between proper log-probabilities and
the “unnormalized log-probabilities” we call logits, and this is the
constant that becomes arbitrarily large when the nll_loss() function
diverges to -inf. Calculating log_softmax (logits) normalizes this
constant away. (And, in some sense, that’s all it does, because
log_softmax (log_softmax (logits)) = log_softmax (logits).)

This constant is the log of the denominator in the formula for
softmax(), namely log (sum_i {exp (logit_i)}).

log_softmax() has the further technical advantage: Calculating
log() of exp() in the normalization constant can become numerically
unstable. Pytorch’s log_softmax() uses the “log-sum-exp trick” to
avoid this numerical instability.

From this perspective, the purpose of pytorch’s log_softmax()
function is to remove this normalization constant – in a numerically
stable way – from the raw, unnormalized logits we get from a linear
layer so we can pass them into a useful loss function.

Best.

K. Frank