My neural network involves pixel classifications. It involves two losses: one is a binary cross entropy, and the other is a multi-label cross entropy.

The yellow graphs are the ones with double logarithm, meaning that we log(sum(ce_loss)). The red pink graphs are the ones with just sum(ce_loss).

The dash lines represent validation step. The solid lines represent training step.

The top yellow and top red-pink figures both represent the count of 1s. Both are supposed to converge to 30. It is clear that the top yellow figure demonstrate that both training and validation converge to 30. While the bottom red-pink figure demonstrate that only training converges to 30, while validation converges to 0… !

My questions are:

I have not been able to find any literature whatsoever that anyone ever used a double logarithm (referencing yellow graphs). But the results are clearly much better than just a typical cross entropy loss (referencing red graphs).

Does anyone know why adding a logarithm on a cross entropy would improve the results? My original purpose of adding the outer logarithm on a CE loss was to increase computational stability. And it does seems to serve my purpose.

The yellow graphs have high training and validation accuracy (due to expected counts of thirty 1s), though the validation loss (middle graph) is increasing. Is this a case of overfitting?

The red graphs have high training accuracy yet poor validation accuracy. The validation loss (middle and right graph) for both losses are increasing. Is this also a case of overfitting?

My colleague questioning the usage of double logarithm, but clearly it seems that double logarithm is performing better than without the second(outer) logarithm.

Any advice and suggestions would be great! Thank you.

I’m not sure what are you trying to achieve by log over ce loss

With regular training you are minimizing ce loss, mean or sum doesn’t matter that much, bcs in the end increased scale will be compensated by smaller model weights.
But by applying log, your new minimum will be achieved if ce loss = 1/N, bcs log(1)=0

To avoid any confusion, let’s be clear that this is not correct.

It is true that many loss functions have zero as their minimum, but
gradient descent and the pytorch optimizers don’t care whether this
is true or not; they simply push the loss function to smaller – and
potentially negative – values.

log() is a so-called monotone function so log (ce_loss) achieves
its minimum exactly where ce_loss achieves its minimum. ce_loss
ranges from 0.0 to inf and, correspondingly, log (ce_loss) ranges
from -inf to inf. When your predictions are “perfect,” ce_loss is 0.0,
its minimum, and log (ce_loss) is -inf, its minimum.

Whether you train with ce_loss or log (ce_loss), the optimization
drives both towards their corresponding minima, namely 0.0 for ce_loss
and -inf for log (ce_loss). Nothing special happens when ce_loss
becomes 1/N (or 1 or any other value other than zero), and the fact that log (ce_loss) happens to be 0.0 when ce_loss = 1/N is irrelevant;
the optimization simply continues to drive log (ce_loss) to algebraically
smaller values, and thus to negative values on down towards -inf.

(Consider what would happen if one were to use log (ce_loss) + 17.2
or log (ce_loss) - 105.9 as the loss function for training. Would the
gradients change? Would the value 17.2 or 105.9 have any effect on the
result of training?)