I misread PyTorch’s `NLLLoss()`

and accidentally passed my model’s probabilities to the loss function instead of my model’s log probabilities, which is what the function expects. However, when I train a model under this misused loss function, the model (a) learns faster, (b) learns more stably, (b) reaches a lower loss, and (d) performs better at the classification task.

I don’t have a minimal working example, but I’m curious if anyone else has experienced this or knows why this is? Any possible hypotheses?

One hypothesis I have is that the gradient with respect to the misused loss function is more stable because the derivative isn’t scaled by 1/model output probability.

Without the `log`

term you wouldn’t calculate the entropy, would you?

From the Wikipedia article on Cross entropy, you would calculate `p(x) * q(x)`

I think.

I haven’t seen this outcome yet, but are you sure your model learns generally better or could the current hyperparameters be just “bad” for the proper usage of `nn.NLLLoss`

with the log probabilities?

Interesting. Here’s my “hypothesis”. You can optimize individual prob instead of log(prob), no problem. But training has standard interpretation as maximization of joint density over dataset, that’s product of sample densities, or their sum on log domain.

You can also view this as optimization of geometric mean, while your procedure maximizes arithmetic mean of probabilities. From this, you can deduce the consequences. My guess would be overfit for “easy” samples.

@ptrblck that’s correct. Without the log, the loss is no longer cross entropy and your `p(x) * q(x)`

is computed instead.

The wrong loss function appears to work better with different learning rates as well as with or without momentum, but I haven’t extensively compared the two loss functions (e.g. I could try switching optimizers).

@googlebot Can you walk me through “you can deduce the consequences?” I’m not sure I understand how cross entropy relates to geometric and arithmetic means.

The task isn’t a typical supervised classification task, so it’s difficult to define what an easy sample is, but the task admits an exact Bayesian baseline and the wrong loss function does better than the correct loss function on datum requiring using inferred latent variables. We think of these datum as the “harder” cases.

In wikipedia’s page linked above, see “relation to log-likelihood” section.

Then consider a dataset with two samples, you optimize their likelihoods independently, and total likelihood is represented by their product, P_total = P1*P2, or log(P_total) = log(P1)+log(P2). So, geometric mean of P1 and P2, or log-average of likelihoods is what “sample loss” normally implicitly maximizes. In other words, increasing smaller likelihoods has bigger effect, due to log transform.

Now if you optimize P1+P2, it is possible to get to the same parameters, but optimization path will be different, less robust I think.

Ok makes sense. Thanks for explaining!

@googlebot why do you say the optimization path will be less robust? I would intuitively think taking the gradient w.r.t `- sum p(x_i | theta)`

would be more stable than `-sum log p(x_i|theta)`

, no?

Yes, but the challenge is to learn the function that produces amortized thetas, theta_i = neural_net(input_i), that will also generalize well. log() acts like a gradient booster for small likelihoods, so samples with smaller “true probability” are not considered less important.

Networks with big enough capacity can learn optimal thetas without this “importance balancing”, but then they’re prone to overfitting.