In a classification task where the input can only belong to one class, the softmax function is naturally used as the final activation function, taking in “logits” (often from a preceeding linear layer) and outputting proper probabilities.
I am confused about the exact meaning of “logits” because many call them “unnormalized log-probabilities”. Yet they are different from applying log directly to the output of softmax, which are actual probabilities.
For example, consider the following experiment:
import torch
import torch.nn.functional as F
logits = torch.tensor([51., 50., 49., 48.])
print('Probas from logits:\n', F.softmax(logits, dim=0))
print('Log-softmax:\n', F.log_softmax(logits, dim=0))
print('Difference between logits and log-softmax:\n', logits - F.log_softmax(logits, dim=0))
print('Probas from log-softmax:\n', F.softmax(F.log_softmax(logits, dim=0), dim=0))
and its output:
Probas from logits:
tensor([0.6439, 0.2369, 0.0871, 0.0321])
Log-softmax:
tensor([-0.4402, -1.4402, -2.4402, -3.4402])
Difference between logits and log-softmax:
tensor([51.4402, 51.4402, 51.4402, 51.4402])
Probas from log-softmax:
tensor([0.6439, 0.2369, 0.0871, 0.0321])
We can see that 1) the difference between the logits and the result of log-softmax is a constant and 2) the logits and the result of log-softmax yield the same probabilities after applying softmax.
So, my question is, why do we have a designated function for log-softmax? Why would we ever need the log-softmax of logits?