Difference between (nn.Linear + nn.CrossEntropyLoss) and (nn.LogSoftmax + nn.NLLLoss)

eugene · July 24, 2018, 11:01am

In a multi-class classification, I sometimes see the following two implementations:

nn.Linear + nn.CrossEntropyLoss
nn.LogSoftmax + nn.NLLLoss

Are they both the same in terms of the following?

Both are softmax classifiers
Mathematically
Model training efficiency
Any other differences?

What are the trade-offs to consider?

If the intention is to do binary classification, what’s the most efficiency way to output a probability?

ptrblck · July 24, 2018, 11:20am

Both approaches are the same.
In fact nn.CrossEntropyLoss just uses nn.LogSoftmax() + nn.NLLLoss() internally.
Here is the line of code.

For a binary classification you could use nn.CrossEntropyLoss() with a logit output of shape [batch_size, 2] or nn.BCELoss() with a nn.Sigmoid() in the last layer.

John1231983 · July 24, 2018, 12:29pm

BCEWithLogitsLoss = One Sigmoid Layer + BCELoss (solved numerically unstable problem)

eugene · July 24, 2018, 1:38pm

Thanks for the quick response! With this, I guess I’ll need to think through it a bit harder to convince myself of it.

NOP · June 26, 2019, 5:45pm

When you said binary classification you mean here just two categories, (like spam or not spam) Right?

ptrblck · June 26, 2019, 7:49pm

Yes, by binary classification I meant a use case with two target classes (positive vs. negative).

Neta_Zmora · December 10, 2019, 3:24pm

Here’s the permalink to the line of code @ptrblck was pointing to in his answer (which pointed to a line in master - a moving target )

ptrblck · December 10, 2019, 3:34pm

Oops, thanks for the permalink