Does cross entropy loss implicitly applies log_softmax?

kirk86 · April 9, 2019, 12:50am

Hi folks,
I’m bit confused in regards to the proper usage of cross entropy loss and log_softmax.
I’ve read somewhere that nn.CrossEntropyLoss() implicitly applies nn.LogSoftmax on the output from your net, is that true?
In that case is the implementation here wrong?

I’ve also read that if you want to be more verbose you could use nn.NLLLoss() with nn.functional.F.log_softmax(), is that true? In some experiments with some small MLPs using the above combination didn’t yield as good results as using simiply nn.CrossEntropyLoss(), are there any other intrinsic differences we should be aware of?

Thanks!

ptrblck · April 9, 2019, 10:19am

The implementation looks indeed wrong, as the code seems to combine nn.LogSoftmax with nn.CrossEntropyLoss.

Yes, you can see it in this line of code.

Both approaches should yield the same results.
The only possible mistake I could think of is if you specify the wrong dim using F.log_softmax.

kirk86 · April 9, 2019, 12:50pm

Thank you, it cleared a lot of things out. The dim I was using in F.log_softmax was -1.
If you don’t mind me asking another stupid question I’ve noticed that torch.topk() with k=1 actually works like torch.max() on a 2-dim x input over the dim=1. Am I correct in assuming that? When I first saw torch.topk() with k=1 it confused me a lot as I was expecting the result to be just one value returned, and that would be the first top result?

ptrblck · April 9, 2019, 2:22pm

Yes, both methods will return the same outputs:

x = torch.randn(10, 5)
print(torch.max(x, dim=1, keepdim=True))
print(torch.topk(x, 1, dim=1))

kirk86 · April 9, 2019, 7:44pm

Thanks a lot it makes sense, although thinking about it topk with k=1 you would expect intuitively to return just the top k results a the function name says, and in this case that would have been 1 value returned.