Could you explain a bit what you mean by performance?

I’d assume that nn.LogSoftmax would give the same performance as nn.Softmax given that it is simply log on top, but it seems to provide much better results.

Is there any explanation to this?

I would say that this could be due to numerical stability reasons. This is related but not similar to negative log likelihood, where the multiplications becomes a summation. In both cases though you could prevent numerical over-/underflow.

The conversion for the softmax is basically

softmax = e^{…} / [sum_k e^{…, class_k, …}]

logsoftmax = log(e^{…}) - log [sum_k e^{…, class_k, …}]

So, you can see that this could be numerically more stable since you don’t have the division there.

I meant the loss function decreasing vs stagnating. Apologies, I should have been more clear.

Andrew, thanks so much. This makes perfect sense. Laying Zipf’s law on top of this adds to the explanation.

No worries, I am just wondering how you compared both non-linearities.

Did you implement your criterion manually or did you use a loss function from `nn.`

?

I used the loss function - CrossEntropy.

You should pass raw logits to `nn.CrossEntropyLoss`

, since the function itself applies `F.log_softmax`

and `nn.NLLLoss()`

on the input.

If you pass log probabilities (from `nn.LogSoftmax`

) or probabilities (from `nn.Softmax()`

) your loss function won’t work as intended.

The `Softmax`

vs `LogSoftmax`

that I am talking about thought is not in the loss function - it’s the last layer in the net, after the LSTM.

Sure, but somehow you are comparing the performance of both non-linearities.

Are you training a model with these non-linearities and feed both to `nn.CrossEntropyLoss()`

?

I am not sure I understand your question, but it’s ok. I believe I have the answer for my original question based on the comments above. Thank you.

The Softmax vs LogSoftmax that I am talking about thought is not in the loss function - it’s the last layer in the net, after the LSTM.

Say you have the generic setup

```
def forward(self, x):
out = self.linear_1(x)
...
out = F.relu(out)
logits = self.linear_out(out)
probas = ACTIVATION(logits, dim=1)
return logits, probas
```

and then your training:

```
for epoch ...:
for minibatch ...:
logits, probas = model(features)
cost = COST_FN(logits, targets)
optimizer.zero_grad()
cost.backward()
optimizer.step()
```

where ACTIVATION would be e.g., F.softmax or F.log_softmax (where F is torch.nn.functional)

I am not sure I understand your question, but it’s ok. I believe I have the answer for my original question based on the comments above. Thank you.

So, when I understand correctly, @ptrblck the question was whether you are passing `logits`

or `probas`

to COST_FN. Mathematically/conceptually, it would make sense to have this be `probas`

that is passed to COST_FN (where COST_FN is e.g., CrossEntropy/F.cross_entropy) but CrossEntropy applies log_softmax itself. So, feeding softmax or log_softmax values to CrossEntropy, although it sounds correct based on how this functions are named, would cause weird results because of applying these fnctions essentially twice then.

It may sound super trivial and is mentioned in the documentation, but I am just pointing it out because I made that mistake some time ago when I started using PyTorch and was spending quite some time with gradient checking, figuring out what was going on

That’s what I think might have gone wrong.

While there are good points in this thread, I’m worried about the validity of the overall method to measure the “performance” of these non-linearities. Thanks for clarifying this issue.

I’ve discovered a mystery of the softmax here.

Accidentally I had two logsoftmax - one was in my loss function ( in cross entropy). Thus, when I had two logsoftmax, the logsoftmax of logsoftmax would give you the same result, thus the model was actually performing correctly, but when I switched to just softmax, then it was messing up the numbers.

Thanks again to all for the explanations.

How I understand the difference between log_softmax and softmax is that,

When you apply log onto complex operations, they become simple. Ex: log(a/b) = log(a) - log(b) and so on.

As both of them are monotonic functions applying log makes the computation of softmax easier and when you again apply exp on output you get your real class values(softmax values themselves).

correct me if I am wrong.

Also, you’ll be numerically more stable as the log-sum-exp trick is applied.

Looks like for super small input, the `log_softmax`

would fail anyway.

For example

```
In [127]: z = th.DoubleTensor(np.array([1e-15, 2e-15, 3e-15]))
In [128]: F.log_softmax(z, dim=0)
Out[128]: tensor([-1.0986, -1.0986, -1.0986], dtype=torch.float64)
```

it returns the same value for the 3 input values.

I think this is a known limitation. While the log-sum-exp trick will save you from overflows (since the largest number would be zero), it won’t save you from underflow, if you are dealing with numbers close to zero.

Your input values are in the original input form as well as after subtracting the max value still in the range `~1e-15`

and would thus underflow to zero, if you apply `torch.exp`

on them:

```
z = torch.tensor([1e-15, 2e-15, 3e-15], dtype=torch.float64)
z_max = torch.max(z)
print(torch.exp(z))
> tensor([1.0000, 1.0000, 1.0000], dtype=torch.float64)
print(torch.exp(z - z_max))
> tensor([1.0000, 1.0000, 1.0000], dtype=torch.float64)
```

Are you sure there’s an actual underflow problem there and not simply that the printout only shows four decimals?

Try printing `torch.exp(z) - 1`

instead: `tensor([1.1102e-15, 1.9984e-15, 3.1086e-15], dtype=torch.float64)`

That looks right to me (with rounding errors due to floating point et c of course), since \lim_{x \to 0} e^x = x and the log_softmax for so small differences shouldn’t be visible.

Unless I’m completely misunderstanding something, in which case I’m looking forward to learning something new when you respond

Hi Daniel!

I think you are correct. I would call this *round-off error* (where,

numerically, `(1.0 + delta) - 1.0`

becomes exactly floating-point

zero somewhere around `delta = 1.e-16`

(for double precision)).

To me, *underflow* is where a very small `epsilon`

becomes exactly

floating-point zero somewhere around `epsilon = 1.e-324`

(for

double precision).

The problem is that for small `delta`

, `exp (delta) ~ 1.0 + delta`

,

so you get exactly this kind of round-off error.

Note that many math libraries, including pytorch, implement the

expm1() function to address this issue.

(I don’t think this helps with `Softmax`

or `LogSoftmax`

though, because

in this case you anyway end up with results of order 1.)

This (0.3.0) script illustrates the round-off error issue and the `expm1()`

function:

```
import torch
torch.__version__
import math
def expm1 (t): # not yet implemented in 0.3.0
res = torch.zeros_like (t)
for i in range (t.shape[0]):
res[i] = math.expm1 (t[i]) # double precision, then truncated, if FloatTensor
return res
z = torch.DoubleTensor ([1.e-15, 2.e-15, 3.e-15])
z_max = torch.max (z)
torch.set_printoptions (precision = 20)
expm1 (z) # correct to about 15 decimal digits
expm1 (z - z_max) # correct to about 15 decimal digits
expm1 (z.float()) # not exactly single precision
expm1 (z.float() - z_max) # not exactly single precision
torch.exp (z) - 1.0 # double precision (without expm1)
torch.exp (z - z_max) - 1.0 # double precision (without expm1)
torch.exp (z.float()) - 1.0 # single precision (without expm1)
torch.exp (z.float() - z_max) - 1.0 # single precision (without expm1)
```

Here is the output:

```
>>> import torch
>>> torch.__version__
'0.3.0b0+591e73e'
>>>
>>> import math
>>>
>>> def expm1 (t): # not yet implemented in 0.3.0
... res = torch.zeros_like (t)
... for i in range (t.shape[0]):
... res[i] = math.expm1 (t[i]) # double precision, then truncated, if FloatTensor
... return res
...
>>> z = torch.DoubleTensor ([1.e-15, 2.e-15, 3.e-15])
>>>
>>> z_max = torch.max (z)
>>>
>>> torch.set_printoptions (precision = 20)
>>>
>>> expm1 (z) # correct to about 15 decimal digits
1.00000e-15 *
1.00000000000000066613
2.00000000000000177636
3.00000000000000444089
[torch.DoubleTensor of size 3]
>>> expm1 (z - z_max) # correct to about 15 decimal digits
1.00000e-15 *
-1.99999999999999755751
-0.99999999999999900080
0.00000000000000000000
[torch.DoubleTensor of size 3]
>>>
>>> expm1 (z.float()) # not exactly single precision
1.00000e-15 *
1.00000000362749363880
2.00000000725498727761
2.99999990500336233268
[torch.FloatTensor of size 3]
>>> expm1 (z.float() - z_max) # not exactly single precision
1.00000e-15 *
-1.99999979549675055424
-0.99999989774837527712
0.00000000000000000000
[torch.FloatTensor of size 3]
>>>
>>> torch.exp (z) - 1.0 # double precision (without expm1)
1.00000e-15 *
1.11022302462515654042
1.99840144432528155072
3.10862446895043786910
[torch.DoubleTensor of size 3]
>>> torch.exp (z - z_max) - 1.0 # double precision (without expm1)
1.00000e-15 *
-1.99840144432528155072
-0.99920072216264077536
0.00000000000000000000
[torch.DoubleTensor of size 3]
```

Best.

K. Frank

thanks, very helpful!