LogSoftmax vs Softmax

Could you explain a bit what you mean by performance?

I’d assume that nn.LogSoftmax would give the same performance as nn.Softmax given that it is simply log on top, but it seems to provide much better results.
Is there any explanation to this?

I would say that this could be due to numerical stability reasons. This is related but not similar to negative log likelihood, where the multiplications becomes a summation. In both cases though you could prevent numerical over-/underflow.

The conversion for the softmax is basically

softmax = e^{…} / [sum_k e^{…, class_k, …}]

logsoftmax = log(e^{…}) - log [sum_k e^{…, class_k, …}]

So, you can see that this could be numerically more stable since you don’t have the division there.

9 Likes

I meant the loss function decreasing vs stagnating. Apologies, I should have been more clear.

Andrew, thanks so much. This makes perfect sense. Laying Zipf’s law on top of this adds to the explanation.

2 Likes

No worries, I am just wondering how you compared both non-linearities.
Did you implement your criterion manually or did you use a loss function from nn.?

I used the loss function - CrossEntropy.

You should pass raw logits to nn.CrossEntropyLoss, since the function itself applies F.log_softmax and nn.NLLLoss() on the input.
If you pass log probabilities (from nn.LogSoftmax) or probabilities (from nn.Softmax()) your loss function won’t work as intended.

6 Likes

The Softmax vs LogSoftmax that I am talking about thought is not in the loss function - it’s the last layer in the net, after the LSTM.

Sure, but somehow you are comparing the performance of both non-linearities.
Are you training a model with these non-linearities and feed both to nn.CrossEntropyLoss()?

1 Like

I am not sure I understand your question, but it’s ok. I believe I have the answer for my original question based on the comments above. Thank you.

The Softmax vs LogSoftmax that I am talking about thought is not in the loss function - it’s the last layer in the net, after the LSTM.

Say you have the generic setup

    def forward(self, x):
        out = self.linear_1(x)
        ...
        out = F.relu(out)
        logits = self.linear_out(out)
        probas = ACTIVATION(logits, dim=1)
        return logits, probas

and then your training:

for epoch ...:
    for minibatch ...:

        logits, probas = model(features)
        cost = COST_FN(logits, targets)
        optimizer.zero_grad()
                cost.backward()
        optimizer.step()

where ACTIVATION would be e.g., F.softmax or F.log_softmax (where F is torch.nn.functional)

I am not sure I understand your question, but it’s ok. I believe I have the answer for my original question based on the comments above. Thank you.

So, when I understand correctly, @ptrblck the question was whether you are passing logits or probas to COST_FN. Mathematically/conceptually, it would make sense to have this be probas that is passed to COST_FN (where COST_FN is e.g., CrossEntropy/F.cross_entropy) but CrossEntropy applies log_softmax itself. So, feeding softmax or log_softmax values to CrossEntropy, although it sounds correct based on how this functions are named, would cause weird results because of applying these fnctions essentially twice then.

It may sound super trivial and is mentioned in the documentation, but I am just pointing it out because I made that mistake some time ago when I started using PyTorch and was spending quite some time with gradient checking, figuring out what was going on

12 Likes

That’s what I think might have gone wrong.
While there are good points in this thread, I’m worried about the validity of the overall method to measure the “performance” of these non-linearities. Thanks for clarifying this issue.

1 Like

I’ve discovered a mystery of the softmax here.
Accidentally I had two logsoftmax - one was in my loss function ( in cross entropy). Thus, when I had two logsoftmax, the logsoftmax of logsoftmax would give you the same result, thus the model was actually performing correctly, but when I switched to just softmax, then it was messing up the numbers.
Thanks again to all for the explanations.

How I understand the difference between log_softmax and softmax is that,
When you apply log onto complex operations, they become simple. Ex: log(a/b) = log(a) - log(b) and so on.
As both of them are monotonic functions applying log makes the computation of softmax easier and when you again apply exp on output you get your real class values(softmax values themselves).

correct me if I am wrong.

Also, you’ll be numerically more stable as the log-sum-exp trick is applied.

1 Like

Looks like for super small input, the log_softmax would fail anyway.
For example

In [127]: z = th.DoubleTensor(np.array([1e-15, 2e-15, 3e-15]))

In [128]: F.log_softmax(z, dim=0)
Out[128]: tensor([-1.0986, -1.0986, -1.0986], dtype=torch.float64)

it returns the same value for the 3 input values.

I think this is a known limitation. While the log-sum-exp trick will save you from overflows (since the largest number would be zero), it won’t save you from underflow, if you are dealing with numbers close to zero.
Your input values are in the original input form as well as after subtracting the max value still in the range ~1e-15 and would thus underflow to zero, if you apply torch.exp on them:

z = torch.tensor([1e-15, 2e-15, 3e-15], dtype=torch.float64)
z_max = torch.max(z)

print(torch.exp(z))
> tensor([1.0000, 1.0000, 1.0000], dtype=torch.float64)
print(torch.exp(z - z_max))
> tensor([1.0000, 1.0000, 1.0000], dtype=torch.float64)
1 Like

Are you sure there’s an actual underflow problem there and not simply that the printout only shows four decimals?

Try printing torch.exp(z) - 1 instead: tensor([1.1102e-15, 1.9984e-15, 3.1086e-15], dtype=torch.float64)

That looks right to me (with rounding errors due to floating point et c of course), since \lim_{x \to 0} e^x = x and the log_softmax for so small differences shouldn’t be visible.

Unless I’m completely misunderstanding something, in which case I’m looking forward to learning something new when you respond :slight_smile:

Hi Daniel!

I think you are correct. I would call this round-off error (where,
numerically, (1.0 + delta) - 1.0 becomes exactly floating-point
zero somewhere around delta = 1.e-16 (for double precision)).

To me, underflow is where a very small epsilon becomes exactly
floating-point zero somewhere around epsilon = 1.e-324 (for
double precision).

The problem is that for small delta, exp (delta) ~ 1.0 + delta,
so you get exactly this kind of round-off error.

Note that many math libraries, including pytorch, implement the
expm1() function to address this issue.

(I don’t think this helps with Softmax or LogSoftmax though, because
in this case you anyway end up with results of order 1.)

This (0.3.0) script illustrates the round-off error issue and the expm1()
function:

import torch
torch.__version__

import math

def expm1 (t):   # not yet implemented in 0.3.0
    res  = torch.zeros_like (t)
    for  i in range (t.shape[0]):
        res[i] = math.expm1 (t[i])   # double precision, then truncated, if FloatTensor
    return res

z = torch.DoubleTensor ([1.e-15, 2.e-15, 3.e-15])

z_max = torch.max (z)

torch.set_printoptions (precision = 20)

expm1 (z)                             # correct to about 15 decimal digits
expm1 (z - z_max)                     # correct to about 15 decimal digits

expm1 (z.float())                     # not exactly single precision
expm1 (z.float() - z_max)             # not exactly single precision

torch.exp (z) - 1.0                   # double precision (without expm1)
torch.exp (z - z_max) - 1.0           # double precision (without expm1)

torch.exp (z.float()) - 1.0           # single precision (without expm1)
torch.exp (z.float() - z_max) - 1.0   # single precision (without expm1)

Here is the output:

>>> import torch
>>> torch.__version__
'0.3.0b0+591e73e'
>>>
>>> import math
>>>
>>> def expm1 (t):   # not yet implemented in 0.3.0
...     res  = torch.zeros_like (t)
...     for  i in range (t.shape[0]):
...         res[i] = math.expm1 (t[i])   # double precision, then truncated, if FloatTensor
...     return res
...
>>> z = torch.DoubleTensor ([1.e-15, 2.e-15, 3.e-15])
>>>
>>> z_max = torch.max (z)
>>>
>>> torch.set_printoptions (precision = 20)
>>>
>>> expm1 (z)                             # correct to about 15 decimal digits

1.00000e-15 *
 1.00000000000000066613
 2.00000000000000177636
 3.00000000000000444089
[torch.DoubleTensor of size 3]

>>> expm1 (z - z_max)                     # correct to about 15 decimal digits

1.00000e-15 *
 -1.99999999999999755751
 -0.99999999999999900080
 0.00000000000000000000
[torch.DoubleTensor of size 3]

>>>
>>> expm1 (z.float())                     # not exactly single precision

1.00000e-15 *
 1.00000000362749363880
 2.00000000725498727761
 2.99999990500336233268
[torch.FloatTensor of size 3]

>>> expm1 (z.float() - z_max)             # not exactly single precision

1.00000e-15 *
 -1.99999979549675055424
 -0.99999989774837527712
 0.00000000000000000000
[torch.FloatTensor of size 3]

>>>
>>> torch.exp (z) - 1.0                   # double precision (without expm1)

1.00000e-15 *
 1.11022302462515654042
 1.99840144432528155072
 3.10862446895043786910
[torch.DoubleTensor of size 3]

>>> torch.exp (z - z_max) - 1.0           # double precision (without expm1)

1.00000e-15 *
 -1.99840144432528155072
 -0.99920072216264077536
 0.00000000000000000000
[torch.DoubleTensor of size 3]

Best.

K. Frank

thanks, very helpful!