LogSoftmax vs Softmax

Hi Daniel!

I think you are correct. I would call this round-off error (where,
numerically, (1.0 + delta) - 1.0 becomes exactly floating-point
zero somewhere around delta = 1.e-16 (for double precision)).

To me, underflow is where a very small epsilon becomes exactly
floating-point zero somewhere around epsilon = 1.e-324 (for
double precision).

The problem is that for small delta, exp (delta) ~ 1.0 + delta,
so you get exactly this kind of round-off error.

Note that many math libraries, including pytorch, implement the
expm1() function to address this issue.

(I don’t think this helps with Softmax or LogSoftmax though, because
in this case you anyway end up with results of order 1.)

This (0.3.0) script illustrates the round-off error issue and the expm1()
function:

import torch
torch.__version__

import math

def expm1 (t):   # not yet implemented in 0.3.0
    res  = torch.zeros_like (t)
    for  i in range (t.shape[0]):
        res[i] = math.expm1 (t[i])   # double precision, then truncated, if FloatTensor
    return res

z = torch.DoubleTensor ([1.e-15, 2.e-15, 3.e-15])

z_max = torch.max (z)

torch.set_printoptions (precision = 20)

expm1 (z)                             # correct to about 15 decimal digits
expm1 (z - z_max)                     # correct to about 15 decimal digits

expm1 (z.float())                     # not exactly single precision
expm1 (z.float() - z_max)             # not exactly single precision

torch.exp (z) - 1.0                   # double precision (without expm1)
torch.exp (z - z_max) - 1.0           # double precision (without expm1)

torch.exp (z.float()) - 1.0           # single precision (without expm1)
torch.exp (z.float() - z_max) - 1.0   # single precision (without expm1)

Here is the output:

>>> import torch
>>> torch.__version__
'0.3.0b0+591e73e'
>>>
>>> import math
>>>
>>> def expm1 (t):   # not yet implemented in 0.3.0
...     res  = torch.zeros_like (t)
...     for  i in range (t.shape[0]):
...         res[i] = math.expm1 (t[i])   # double precision, then truncated, if FloatTensor
...     return res
...
>>> z = torch.DoubleTensor ([1.e-15, 2.e-15, 3.e-15])
>>>
>>> z_max = torch.max (z)
>>>
>>> torch.set_printoptions (precision = 20)
>>>
>>> expm1 (z)                             # correct to about 15 decimal digits

1.00000e-15 *
 1.00000000000000066613
 2.00000000000000177636
 3.00000000000000444089
[torch.DoubleTensor of size 3]

>>> expm1 (z - z_max)                     # correct to about 15 decimal digits

1.00000e-15 *
 -1.99999999999999755751
 -0.99999999999999900080
 0.00000000000000000000
[torch.DoubleTensor of size 3]

>>>
>>> expm1 (z.float())                     # not exactly single precision

1.00000e-15 *
 1.00000000362749363880
 2.00000000725498727761
 2.99999990500336233268
[torch.FloatTensor of size 3]

>>> expm1 (z.float() - z_max)             # not exactly single precision

1.00000e-15 *
 -1.99999979549675055424
 -0.99999989774837527712
 0.00000000000000000000
[torch.FloatTensor of size 3]

>>>
>>> torch.exp (z) - 1.0                   # double precision (without expm1)

1.00000e-15 *
 1.11022302462515654042
 1.99840144432528155072
 3.10862446895043786910
[torch.DoubleTensor of size 3]

>>> torch.exp (z - z_max) - 1.0           # double precision (without expm1)

1.00000e-15 *
 -1.99840144432528155072
 -0.99920072216264077536
 0.00000000000000000000
[torch.DoubleTensor of size 3]

Best.

K. Frank

thanks, very helpful!

Why then in PyTorch documentation such example:

>>> # Example of target with class indices
>>> loss = nn.CrossEntropyLoss()
>>> input = torch.randn(3, 5, requires_grad=True)
>>> target = torch.empty(3, dtype=torch.long).random_(5)
>>> output = loss(input, target)
>>> output.backward()
>>>
>>> # Example of target with class probabilities
>>> input = torch.randn(3, 5, requires_grad=True)
>>> target = torch.randn(3, 5).softmax(dim=1)
>>> output = loss(input, target)
>>> output.backward()

It is confusing a lot !!
Also see https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html:

This function doesn’t work directly with NLLLoss, which expects the Log to be computed between the Softmax and itself. Use log_softmax instead (it’s faster and has better numerical properties).

Why in CrossEntropyLoss example as alternative shown softmax ??

CrossEntropyLoss = LogSoftmax(x) + NLLoss(x)
LogSoftmax = Log(Softmax(x))
Softmax(x) != LogSoftmax(x)

Why such inconsistency in documentation ?
Examples for CrossEntropyLoss should be fixed !!

Hi Denis!

At issue is that some new functionality has been added to pytorch’s
CrossEntropyLoss as of pytorch version 1.10.

Compare the documentation for CrossEntropyLoss in versions 1.9 and 1.10.

As of version 1.10, CrossEntropyLoss accepts probabilistic targets (that
are floating-point numbers), in addition to integer-class-label targets.

The sole purpose of softmax() in this example is to generate a target
that is a legitimate probability distribution across the class dimension. Note
that softmax() is not being applied to input.

Best.

K. Frank

1 Like

Just now running into this thread and can confirm that having an nn.Softmax() activation layer and passing that to log_softmax() or CrossEntropyLoss() is really bad. DermaMNIST dataset with basic CNN+FCN network (interview question).
With nn.Softmax() activation - stuck at 0.67 train acc:
[22/100] train_loss: 1.499 - train_acc: 0.670 - eval_loss: 1.467 - eval_acc: 0.669

No nn.Softmax() layer - just pass logits
[22/100] train_loss: 0.129 - train_acc: 0.956 - eval_loss: 1.831 - eval_acc: 0.699

Hey What are the input and output/target dimensions for the LSTM in this discussion? (Batch size,Sequence Length, Feature vector) ? Curious whether the sequence is too long? Are you tracking gradient and loss values over epochs? Is the feature vector sparse&orboolorlongorfloat? I am trying to experiment all the variables for RNN/LSTM/GRU models with various types of loss functions and architectures. Please share which of the various loss functions you tested here works better and under what conditions.

How does passing softmax activation output to log_softmax works? Just trying to get understanding of the obejctive function. If softmax is f then log_softmax is log(f). Passing softmax to log_softmax would be log(f(f)). Is f(f) tractable & differentiable? Passing softmax to cross_entropy is quite similar to passing softmax activation to log_softmax but with sum over target vector.