What is the difference between log_softmax and softmax?

What is the difference between log_softmax and softmax?
How to explain them in mathematics?
Thank you!


log_softmax applies logarithm after softmax.


exp(x_i) / exp(x).sum()


log( exp(x_i) / exp(x).sum() )

log_softmax essential does log(softmax(x)), but the practical implementation is different and more efficient while doing the same operation. You might want to have a look at http://pytorch.org/docs/master/nn.html?highlight=log_softmax#torch.nn.LogSoftmax and the source code.


Can you please link that implementation?
Is it calculated by, x_i - log( exp(x).sum() ) ?

The implementation is done in torch.nn.functional where the function is called from c code: http://pytorch.org/docs/master/_modules/torch/nn/functional.html#log_softmax.

Is there any way to see the c code?

1 Like
1 Like

@KaiyangZhou’s answer may have been correct once, but does not match the current documentation, which reads:

“While mathematically equivalent to log(softmax(x)), doing these two
operations separately is slower, and numerically unstable. This function
uses an alternative formulation to compute the output and gradient correctly.”

And unfortunately the linked-to source for log_softmax merely includes a call to another .log_softmax() method which is defined somewhere else, but I have been unable to find it, even after running grep -r 'def log_softmax * on the pytorch directory.

EDIT: Regarding the source, Similar post: “Understanding code organization: where is `log_softmax` really implemented?”, answered by @ptrblck as pointing to the source code here: https://github.com/pytorch/pytorch/blob/420b37f3c67950ed93cd8aa7a12e673fcfc5567b/aten/src/ATen/native/SoftMax.cpp#L146 …And yet all that does is call still-other functions log_softmax_lastdim_kernel() or host_softmax. Still trying to find where the actual implementation is, not just calls-to-calls-to-calls.

You are right. There are two more dispatches involved and eventually _vec_log_softmax_lastdim is called for the log_softmax with a non-scalar input.


Is taking F.softmax and then applying torch.log same as F.log_softmax??

In theory these methods are equal, in practice F.log_softmax is numerically more stable, as it uses the log-sum-exp trick internally.