What is the difference between log_softmax and softmax?

How to explain them in mathematics?

Thank you!

log_softmax applies logarithm after softmax.

softmax:

```
exp(x_i) / exp(x).sum()
```

log_softmax:

```
log( exp(x_i) / exp(x).sum() )
```

log_softmax essential does log(softmax(x)), but the practical implementation is different and more efficient while doing the same operation. You might want to have a look at http://pytorch.org/docs/master/nn.html?highlight=log_softmax#torch.nn.LogSoftmax and the source code.

Can you please link that implementation?

Is it calculated by, `x_i - log( exp(x).sum() )`

?

The implementation is done in `torch.nn.functional`

where the function is called from c code: http://pytorch.org/docs/master/_modules/torch/nn/functional.html#log_softmax.

Is there any way to see the c code?

@KaiyangZhou’s answer may have been correct once, but does not match the current documentation, which reads:

“While mathematically equivalent to log(softmax(x)), doing these two

operations separately is slower, and numerically unstable. This function

uses an alternative formulation to compute the output and gradient correctly.”

And unfortunately the linked-to source for `log_softmax`

merely includes a call to another `.log_softmax()`

method which is defined somewhere else, but I have been unable to find it, even after running `grep -r 'def log_softmax *`

on the pytorch directory.

**EDIT:** Regarding the source, Similar post: “Understanding code organization: where is `log_softmax` really implemented?”, answered by @ptrblck as pointing to the source code here: https://github.com/pytorch/pytorch/blob/420b37f3c67950ed93cd8aa7a12e673fcfc5567b/aten/src/ATen/native/SoftMax.cpp#L146 …And yet all that does is call still-other functions `log_softmax_lastdim_kernel()`

or `host_softmax`

. Still trying to find where the actual implementation is, not just calls-to-calls-to-calls.

You are right. There are two more dispatches involved and eventually `_vec_log_softmax_lastdim`

is called for the `log_softmax`

with a non-scalar input.

Hi,

Is taking F.softmax and then applying torch.log same as F.log_softmax??

In theory these methods are equal, in practice `F.log_softmax`

is numerically more stable, as it uses the log-sum-exp trick internally.