Loss of single batch optimization

When debugging a neural network model, do we expect to get 0 loss on a single batch? For example, running 1000 times optimization on the same single batch input.

If the loss is close to 0, we can conclude the model is able to overfit on that batch.
if the loss is far from 0, e.g. say start at 5.5 and end at 4.7, what does it tell us?

For example:

  • it needs more training time on a single batch
  • model is not able to overfit a single batch, its design is not working
  • model has way too many parameters for a single batch data to tune
  • any other possible explanations are welcomed~

Thanks!!!

The relative loss values depend on the used criterion, but overfitting a single batch should converge the loss to zero for a “standard” criterion.
If your model isn’t able to overfit a single sample, the overall training is ill-defined. I.e. the model architecture, hyperparameters, etc. could be “bad” and it’s hard to give a general statement as it depends on the use case.

Thanks for replying to this “vague” question. Yes, it feels like there could be any kind of causes depending on use case.

If the model contains dropout and layenorm, do we expect to see zero loss for a single example during training? In this case, we should set model.eval() to expect zero loss, right?

what is a “standard” criterion and what’s not? Thanks.

I would not expect a perfect 0 as the final loss in any case alone due to the limited floating point precision (but a small value instead).
By “standard” criterion I meant the loss functions included in torch.nn assuming you are not inverting them or add any constant etc.
E.g. depending on your use case, you might define a “zero” loss converging towards e.g. -100 if it fits your use case (mathematically it wouldn’t make a difference if you just subtract a constant from the loss).