Difference between Cross-Entropy Loss or Log Likelihood Loss?

I’m very confused the difference between cross-entropy loss or log likelihood loss when dealing with Multi-Class Classification (including Binary Classification) or Multi-Label Classification ?
Could you explain the difference ?

1 Like

Hello Doubt -

This may be a duplicate; see below*.

The cross-entropy loss and the (negative) log-likelihood are
the same in the following sense:

If you apply Pytorch’s CrossEntropyLoss to your output layer,
you get the same result as applying Pytorch’s NLLLoss to a
LogSoftmax layer added after your original output layer.

(I suspect – but don’t know for a fact – that using
CrossEntropyLoss will be more efficient because it
can collapse some calculations together, and doesn’t
introduce an additional layer.)

You are trying to maximize the “likelihood” of your model
parameters (weights) having the right values. Maximizing
the likelihood is the same as maximizing the log-likelihood,
which is the same as minimizing the negative-log-likelihood.
For the classification problem, the cross-entropy is the
negative-log-likelihood. (The “math” definition of cross-entropy
applies to your output layer being a (discrete) probability
distribution. Pytorch’s CrossEntropyLoss implicitly adds
a soft-max that “normalizes” your output layer into such a
probability distribution.)

Wikipedia has some explanation of the equivalence of
negative-log-likelihood and Cross entropy.

*Possible duplicate:


K. Frank


I have understood your words.

Hi, I have wrote a little post about KL divergence, Cross-Entropy and Negative Log-Likelihood Loss a few weeks ago: https://medium.com/@stepanulyanin/notes-on-deep-learning-theory-part-1-data-generating-process-31fdda2c8941. I hope you can find a few answers there too.


As pointed out above, conceptually negative log likelihood and cross entropy are the same. And cross entropy is a generalization of binary cross entropy if you have multiple classes and use one-hot encoding. The confusion is mostly due to the naming in PyTorch namely that it expects different input representations. While it’s conceptually the same, the way it’s used implementation-wise is something to memorize/be aware of. I made a quick cheatsheet for my students because of that: https://github.com/rasbt/stat479-deep-learning-ss19/blob/master/other/pytorch-lossfunc-cheatsheet.md


I have come to understand this problem :smile:

Thanks @K_Frank, @halahup and @rasbt.