LogSoftmax vs Softmax

Hi Daniel!

I think you are correct. I would call this round-off error (where,
numerically, (1.0 + delta) - 1.0 becomes exactly floating-point
zero somewhere around delta = 1.e-16 (for double precision)).

To me, underflow is where a very small epsilon becomes exactly
floating-point zero somewhere around epsilon = 1.e-324 (for
double precision).

The problem is that for small delta, exp (delta) ~ 1.0 + delta,
so you get exactly this kind of round-off error.

Note that many math libraries, including pytorch, implement the
expm1() function to address this issue.

(I don’t think this helps with Softmax or LogSoftmax though, because
in this case you anyway end up with results of order 1.)

This (0.3.0) script illustrates the round-off error issue and the expm1()
function:

import torch
torch.__version__

import math

def expm1 (t):   # not yet implemented in 0.3.0
    res  = torch.zeros_like (t)
    for  i in range (t.shape[0]):
        res[i] = math.expm1 (t[i])   # double precision, then truncated, if FloatTensor
    return res

z = torch.DoubleTensor ([1.e-15, 2.e-15, 3.e-15])

z_max = torch.max (z)

torch.set_printoptions (precision = 20)

expm1 (z)                             # correct to about 15 decimal digits
expm1 (z - z_max)                     # correct to about 15 decimal digits

expm1 (z.float())                     # not exactly single precision
expm1 (z.float() - z_max)             # not exactly single precision

torch.exp (z) - 1.0                   # double precision (without expm1)
torch.exp (z - z_max) - 1.0           # double precision (without expm1)

torch.exp (z.float()) - 1.0           # single precision (without expm1)
torch.exp (z.float() - z_max) - 1.0   # single precision (without expm1)

Here is the output:

>>> import torch
>>> torch.__version__
'0.3.0b0+591e73e'
>>>
>>> import math
>>>
>>> def expm1 (t):   # not yet implemented in 0.3.0
...     res  = torch.zeros_like (t)
...     for  i in range (t.shape[0]):
...         res[i] = math.expm1 (t[i])   # double precision, then truncated, if FloatTensor
...     return res
...
>>> z = torch.DoubleTensor ([1.e-15, 2.e-15, 3.e-15])
>>>
>>> z_max = torch.max (z)
>>>
>>> torch.set_printoptions (precision = 20)
>>>
>>> expm1 (z)                             # correct to about 15 decimal digits

1.00000e-15 *
 1.00000000000000066613
 2.00000000000000177636
 3.00000000000000444089
[torch.DoubleTensor of size 3]

>>> expm1 (z - z_max)                     # correct to about 15 decimal digits

1.00000e-15 *
 -1.99999999999999755751
 -0.99999999999999900080
 0.00000000000000000000
[torch.DoubleTensor of size 3]

>>>
>>> expm1 (z.float())                     # not exactly single precision

1.00000e-15 *
 1.00000000362749363880
 2.00000000725498727761
 2.99999990500336233268
[torch.FloatTensor of size 3]

>>> expm1 (z.float() - z_max)             # not exactly single precision

1.00000e-15 *
 -1.99999979549675055424
 -0.99999989774837527712
 0.00000000000000000000
[torch.FloatTensor of size 3]

>>>
>>> torch.exp (z) - 1.0                   # double precision (without expm1)

1.00000e-15 *
 1.11022302462515654042
 1.99840144432528155072
 3.10862446895043786910
[torch.DoubleTensor of size 3]

>>> torch.exp (z - z_max) - 1.0           # double precision (without expm1)

1.00000e-15 *
 -1.99840144432528155072
 -0.99920072216264077536
 0.00000000000000000000
[torch.DoubleTensor of size 3]

Best.

K. Frank

thanks, very helpful!

Why then in PyTorch documentation such example:

>>> # Example of target with class indices
>>> loss = nn.CrossEntropyLoss()
>>> input = torch.randn(3, 5, requires_grad=True)
>>> target = torch.empty(3, dtype=torch.long).random_(5)
>>> output = loss(input, target)
>>> output.backward()
>>>
>>> # Example of target with class probabilities
>>> input = torch.randn(3, 5, requires_grad=True)
>>> target = torch.randn(3, 5).softmax(dim=1)
>>> output = loss(input, target)
>>> output.backward()

It is confusing a lot !!
Also see https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html:

This function doesn’t work directly with NLLLoss, which expects the Log to be computed between the Softmax and itself. Use log_softmax instead (it’s faster and has better numerical properties).

Why in CrossEntropyLoss example as alternative shown softmax ??

CrossEntropyLoss = LogSoftmax(x) + NLLoss(x)
LogSoftmax = Log(Softmax(x))
Softmax(x) != LogSoftmax(x)

Why such inconsistency in documentation ?
Examples for CrossEntropyLoss should be fixed !!

Hi Denis!

At issue is that some new functionality has been added to pytorch’s
CrossEntropyLoss as of pytorch version 1.10.

Compare the documentation for CrossEntropyLoss in versions 1.9 and 1.10.

As of version 1.10, CrossEntropyLoss accepts probabilistic targets (that
are floating-point numbers), in addition to integer-class-label targets.

The sole purpose of softmax() in this example is to generate a target
that is a legitimate probability distribution across the class dimension. Note
that softmax() is not being applied to input.

Best.

K. Frank

1 Like

Just now running into this thread and can confirm that having an nn.Softmax() activation layer and passing that to log_softmax() or CrossEntropyLoss() is really bad. DermaMNIST dataset with basic CNN+FCN network (interview question).
With nn.Softmax() activation - stuck at 0.67 train acc:
[22/100] train_loss: 1.499 - train_acc: 0.670 - eval_loss: 1.467 - eval_acc: 0.669

No nn.Softmax() layer - just pass logits
[22/100] train_loss: 0.129 - train_acc: 0.956 - eval_loss: 1.831 - eval_acc: 0.699