Hi there,
I’d assume that nn.LogSoftmax
would give the same performance as nn.Softmax
given that it is simply log on top, but it seems to provide much better results.
Is there any explanation to this?
I’m not sure if there is a definitive answer to why this works better, but to provide some insight, its worth noting that using the log-likelihood is very common in statistics. Here are some references on the use of the log-likelihood [1], [2], [3].
One key point to notice is, depending on your loss function, this fundamentally changes the calculation. Let’s consider a case were your true class is 1 and your model estimates the probability of the true class is .9. If you loss function is the L1 Loss function, the value of the loss function is 0.1. On the other hand, if you are using the log-likelihood then the value of the loss function is 0.105 (assuming natural log).
On the other hand, if your estimated probability is 0.3 and you are using the likelihood function the value of your loss function is 0.7. If you are using the log-likelihood function the value of your loss function is 1.20.
Now if we consider these two cases, using the standard likelihood function (akin to softmax), the error increases by a factor of 7 (.7/.1) between those two examples. Using the log-likelihood function (akin to log-softmax) the error increases by a factor of ~11 (1.20/.105). The point is, even though logsoftmax and softmax are monotonic, their effect on the relative values of the loss function changes. Using the log-softmax will punish bigger mistakes in likelihood space higher.
Could you explain a bit what you mean by performance?
I’d assume that nn.LogSoftmax would give the same performance as nn.Softmax given that it is simply log on top, but it seems to provide much better results.
Is there any explanation to this?
I would say that this could be due to numerical stability reasons. This is related but not similar to negative log likelihood, where the multiplications becomes a summation. In both cases though you could prevent numerical over-/underflow.
The conversion for the softmax is basically
softmax = e^{…} / [sum_k e^{…, class_k, …}]
logsoftmax = log(e^{…}) - log [sum_k e^{…, class_k, …}]
So, you can see that this could be numerically more stable since you don’t have the division there.
I meant the loss function decreasing vs stagnating. Apologies, I should have been more clear.
Andrew, thanks so much. This makes perfect sense. Laying Zipf’s law on top of this adds to the explanation.
No worries, I am just wondering how you compared both non-linearities.
Did you implement your criterion manually or did you use a loss function from nn.
?
I used the loss function - CrossEntropy.
You should pass raw logits to nn.CrossEntropyLoss
, since the function itself applies F.log_softmax
and nn.NLLLoss()
on the input.
If you pass log probabilities (from nn.LogSoftmax
) or probabilities (from nn.Softmax()
) your loss function won’t work as intended.
The Softmax
vs LogSoftmax
that I am talking about thought is not in the loss function - it’s the last layer in the net, after the LSTM.
Sure, but somehow you are comparing the performance of both non-linearities.
Are you training a model with these non-linearities and feed both to nn.CrossEntropyLoss()
?
I am not sure I understand your question, but it’s ok. I believe I have the answer for my original question based on the comments above. Thank you.
The Softmax vs LogSoftmax that I am talking about thought is not in the loss function - it’s the last layer in the net, after the LSTM.
Say you have the generic setup
def forward(self, x):
out = self.linear_1(x)
...
out = F.relu(out)
logits = self.linear_out(out)
probas = ACTIVATION(logits, dim=1)
return logits, probas
and then your training:
for epoch ...:
for minibatch ...:
logits, probas = model(features)
cost = COST_FN(logits, targets)
optimizer.zero_grad()
cost.backward()
optimizer.step()
where ACTIVATION would be e.g., F.softmax or F.log_softmax (where F is torch.nn.functional)
I am not sure I understand your question, but it’s ok. I believe I have the answer for my original question based on the comments above. Thank you.
So, when I understand correctly, @ptrblck the question was whether you are passing logits
or probas
to COST_FN. Mathematically/conceptually, it would make sense to have this be probas
that is passed to COST_FN (where COST_FN is e.g., CrossEntropy/F.cross_entropy) but CrossEntropy applies log_softmax itself. So, feeding softmax or log_softmax values to CrossEntropy, although it sounds correct based on how this functions are named, would cause weird results because of applying these fnctions essentially twice then.
It may sound super trivial and is mentioned in the documentation, but I am just pointing it out because I made that mistake some time ago when I started using PyTorch and was spending quite some time with gradient checking, figuring out what was going on
That’s what I think might have gone wrong.
While there are good points in this thread, I’m worried about the validity of the overall method to measure the “performance” of these non-linearities. Thanks for clarifying this issue.
I’ve discovered a mystery of the softmax here.
Accidentally I had two logsoftmax - one was in my loss function ( in cross entropy). Thus, when I had two logsoftmax, the logsoftmax of logsoftmax would give you the same result, thus the model was actually performing correctly, but when I switched to just softmax, then it was messing up the numbers.
Thanks again to all for the explanations.
How I understand the difference between log_softmax and softmax is that,
When you apply log onto complex operations, they become simple. Ex: log(a/b) = log(a) - log(b) and so on.
As both of them are monotonic functions applying log makes the computation of softmax easier and when you again apply exp on output you get your real class values(softmax values themselves).
correct me if I am wrong.
Also, you’ll be numerically more stable as the log-sum-exp trick is applied.
Looks like for super small input, the log_softmax
would fail anyway.
For example
In [127]: z = th.DoubleTensor(np.array([1e-15, 2e-15, 3e-15]))
In [128]: F.log_softmax(z, dim=0)
Out[128]: tensor([-1.0986, -1.0986, -1.0986], dtype=torch.float64)
it returns the same value for the 3 input values.
I think this is a known limitation. While the log-sum-exp trick will save you from overflows (since the largest number would be zero), it won’t save you from underflow, if you are dealing with numbers close to zero.
Your input values are in the original input form as well as after subtracting the max value still in the range ~1e-15
and would thus underflow to zero, if you apply torch.exp
on them:
z = torch.tensor([1e-15, 2e-15, 3e-15], dtype=torch.float64)
z_max = torch.max(z)
print(torch.exp(z))
> tensor([1.0000, 1.0000, 1.0000], dtype=torch.float64)
print(torch.exp(z - z_max))
> tensor([1.0000, 1.0000, 1.0000], dtype=torch.float64)
Are you sure there’s an actual underflow problem there and not simply that the printout only shows four decimals?
Try printing torch.exp(z) - 1
instead: tensor([1.1102e-15, 1.9984e-15, 3.1086e-15], dtype=torch.float64)
That looks right to me (with rounding errors due to floating point et c of course), since \lim_{x \to 0} e^x = x and the log_softmax for so small differences shouldn’t be visible.
Unless I’m completely misunderstanding something, in which case I’m looking forward to learning something new when you respond