LogSoftmax vs Softmax

Hi there,
I’d assume that nn.LogSoftmax would give the same performance as nn.Softmax given that it is simply log on top, but it seems to provide much better results.
Is there any explanation to this?

1 Like

I’m not sure if there is a definitive answer to why this works better, but to provide some insight, its worth noting that using the log-likelihood is very common in statistics. Here are some references on the use of the log-likelihood [1], [2], [3].

One key point to notice is, depending on your loss function, this fundamentally changes the calculation. Let’s consider a case were your true class is 1 and your model estimates the probability of the true class is .9. If you loss function is the L1 Loss function, the value of the loss function is 0.1. On the other hand, if you are using the log-likelihood then the value of the loss function is 0.105 (assuming natural log).

On the other hand, if your estimated probability is 0.3 and you are using the likelihood function the value of your loss function is 0.7. If you are using the log-likelihood function the value of your loss function is 1.20.

Now if we consider these two cases, using the standard likelihood function (akin to softmax), the error increases by a factor of 7 (.7/.1) between those two examples. Using the log-likelihood function (akin to log-softmax) the error increases by a factor of ~11 (1.20/.105). The point is, even though logsoftmax and softmax are monotonic, their effect on the relative values of the loss function changes. Using the log-softmax will punish bigger mistakes in likelihood space higher.

11 Likes

Could you explain a bit what you mean by performance?

I’d assume that nn.LogSoftmax would give the same performance as nn.Softmax given that it is simply log on top, but it seems to provide much better results.
Is there any explanation to this?

I would say that this could be due to numerical stability reasons. This is related but not similar to negative log likelihood, where the multiplications becomes a summation. In both cases though you could prevent numerical over-/underflow.

The conversion for the softmax is basically

softmax = e^{…} / [sum_k e^{…, class_k, …}]

logsoftmax = log(e^{…}) - log [sum_k e^{…, class_k, …}]

So, you can see that this could be numerically more stable since you don’t have the division there.

3 Likes

I meant the loss function decreasing vs stagnating. Apologies, I should have been more clear.

Andrew, thanks so much. This makes perfect sense. Laying Zipf’s law on top of this adds to the explanation.

1 Like

No worries, I am just wondering how you compared both non-linearities.
Did you implement your criterion manually or did you use a loss function from nn.?

I used the loss function - CrossEntropy.

You should pass raw logits to nn.CrossEntropyLoss, since the function itself applies F.log_softmax and nn.NLLLoss() on the input.
If you pass log probabilities (from nn.LogSoftmax) or probabilities (from nn.Softmax()) your loss function won’t work as intended.

2 Likes

The Softmax vs LogSoftmax that I am talking about thought is not in the loss function - it’s the last layer in the net, after the LSTM.

Sure, but somehow you are comparing the performance of both non-linearities.
Are you training a model with these non-linearities and feed both to nn.CrossEntropyLoss()?

I am not sure I understand your question, but it’s ok. I believe I have the answer for my original question based on the comments above. Thank you.

The Softmax vs LogSoftmax that I am talking about thought is not in the loss function - it’s the last layer in the net, after the LSTM.

Say you have the generic setup

    def forward(self, x):
        out = self.linear_1(x)
        ...
        out = F.relu(out)
        logits = self.linear_out(out)
        probas = ACTIVATION(logits, dim=1)
        return logits, probas

and then your training:

for epoch ...:
    for minibatch ...:

        logits, probas = model(features)
        cost = COST_FN(logits, targets)
        optimizer.zero_grad()
                cost.backward()
        optimizer.step()

where ACTIVATION would be e.g., F.softmax or F.log_softmax (where F is torch.nn.functional)

I am not sure I understand your question, but it’s ok. I believe I have the answer for my original question based on the comments above. Thank you.

So, when I understand correctly, @ptrblck the question was whether you are passing logits or probas to COST_FN. Mathematically/conceptually, it would make sense to have this be probas that is passed to COST_FN (where COST_FN is e.g., CrossEntropy/F.cross_entropy) but CrossEntropy applies log_softmax itself. So, feeding softmax or log_softmax values to CrossEntropy, although it sounds correct based on how this functions are named, would cause weird results because of applying these fnctions essentially twice then.

It may sound super trivial and is mentioned in the documentation, but I am just pointing it out because I made that mistake some time ago when I started using PyTorch and was spending quite some time with gradient checking, figuring out what was going on

4 Likes

That’s what I think might have gone wrong.
While there are good points in this thread, I’m worried about the validity of the overall method to measure the “performance” of these non-linearities. Thanks for clarifying this issue.

I’ve discovered a mystery of the softmax here.
Accidentally I had two logsoftmax - one was in my loss function ( in cross entropy). Thus, when I had two logsoftmax, the logsoftmax of logsoftmax would give you the same result, thus the model was actually performing correctly, but when I switched to just softmax, then it was messing up the numbers.
Thanks again to all for the explanations.