Difference between (nn.Linear + nn.CrossEntropyLoss) and (nn.LogSoftmax + nn.NLLLoss)

In a multi-class classification, I sometimes see the following two implementations:

  • nn.Linear + nn.CrossEntropyLoss
  • nn.LogSoftmax + nn.NLLLoss

Are they both the same in terms of the following?

  • Both are softmax classifiers
  • Mathematically
  • Model training efficiency
  • Any other differences?

What are the trade-offs to consider?

If the intention is to do binary classification, what’s the most efficiency way to output a probability?

1 Like

Both approaches are the same.
In fact nn.CrossEntropyLoss just uses nn.LogSoftmax() + nn.NLLLoss() internally.
Here is the line of code.

For a binary classification you could use nn.CrossEntropyLoss() with a logit output of shape [batch_size, 2] or nn.BCELoss() with a nn.Sigmoid() in the last layer.


BCEWithLogitsLoss = One Sigmoid Layer + BCELoss (solved numerically unstable problem)

Thanks for the quick response! With this, I guess I’ll need to think through it a bit harder to convince myself of it.

When you said binary classification you mean here just two categories, (like spam or not spam) Right?

Yes, by binary classification I meant a use case with two target classes (positive vs. negative).

1 Like

Here’s the permalink to the line of code @ptrblck was pointing to in his answer (which pointed to a line in master - a moving target :wink: )

Oops, thanks for the permalink :wink: