# The effects of Double Logarithms (Log Cross Entropy Loss) + Overfitting

Hello everyone,

My neural network involves pixel classifications. It involves two losses: one is a binary cross entropy, and the other is a multi-label cross entropy.

The yellow graphs are the ones with double logarithm, meaning that we log(sum(ce_loss)). The red pink graphs are the ones with just sum(ce_loss).

The dash lines represent validation step. The solid lines represent training step.

The top yellow and top red-pink figures both represent the count of 1s. Both are supposed to converge to 30. It is clear that the top yellow figure demonstrate that both training and validation converge to 30. While the bottom red-pink figure demonstrate that only training converges to 30, while validation converges to 0… !

My questions are:

1. I have not been able to find any literature whatsoever that anyone ever used a double logarithm (referencing yellow graphs). But the results are clearly much better than just a typical cross entropy loss (referencing red graphs).

Does anyone know why adding a logarithm on a cross entropy would improve the results? My original purpose of adding the outer logarithm on a CE loss was to increase computational stability. And it does seems to serve my purpose.

1. The yellow graphs have high training and validation accuracy (due to expected counts of thirty 1s), though the validation loss (middle graph) is increasing. Is this a case of overfitting?

2. The red graphs have high training accuracy yet poor validation accuracy. The validation loss (middle and right graph) for both losses are increasing. Is this also a case of overfitting?

My colleague questioning the usage of double logarithm, but clearly it seems that double logarithm is performing better than without the second(outer) logarithm.

Any advice and suggestions would be great! Thank you.

I’m not sure what are you trying to achieve by log over ce loss

With regular training you are minimizing ce loss, mean or sum doesn’t matter that much, bcs in the end increased scale will be compensated by smaller model weights.
But by applying log, your new minimum will be achieved if ce loss = 1/N, bcs log(1)=0

Hi Sergey!

To avoid any confusion, let’s be clear that this is not correct.

It is true that many loss functions have zero as their minimum, but
gradient descent and the pytorch optimizers don’t care whether this
is true or not; they simply push the loss function to smaller – and
potentially negative – values.

`log()` is a so-called monotone function so `log (ce_loss)` achieves
its minimum exactly where `ce_loss` achieves its minimum. `ce_loss`
ranges from `0.0` to `inf` and, correspondingly, `log (ce_loss)` ranges
from `-inf` to `inf`. When your predictions are “perfect,” `ce_loss` is `0.0`,
its minimum, and `log (ce_loss)` is `-inf`, its minimum.

Whether you train with `ce_loss` or `log (ce_loss)`, the optimization
drives both towards their corresponding minima, namely `0.0` for `ce_loss`
and `-inf` for `log (ce_loss)`. Nothing special happens when `ce_loss`
becomes `1/N` (or `1` or any other value other than zero), and the fact that
`log (ce_loss)` happens to be `0.0` when `ce_loss = 1/N` is irrelevant;
the optimization simply continues to drive `log (ce_loss)` to algebraically
smaller values, and thus to negative values on down towards `-inf`.

(Consider what would happen if one were to use `log (ce_loss) + 17.2`
or `log (ce_loss) - 105.9` as the loss function for training. Would the
gradients change? Would the value `17.2` or `105.9` have any effect on the
result of training?)

Best.

K. Frank

Right, thanks for catching this.
I mentally flipped the gradient direction like there was abs )

i think by using second log, you are only changing effective learning rate. i.e. if sum ce is around `10` then your effective learning rate is `.1*lr`.