Hi Edward!
Yes, it is perfectly fine to use a loss that can become negative.
Your reasoning about this is correct.
To add a few words of explanation:
A smaller loss – algebraically less positive or algebraically more
negative – means (or should mean) better predictions. The
optimization step uses some version of gradient descent to make
your loss smaller. The overall level of the loss doesn’t matter as
far as the optimization goes. The gradient tells the optimizer how
to change the model parameters to reduce the loss, and it doesn’t
care about the overall level of the loss.
When gradient descent drives the loss to a minimum, the gradient
becomes zero (although it can be zero at places other than a
minimum). (Also, when the gradient is zero, plain-vanilla gradient
descent stops changing the parameters.)
It is true that several common loss functions are non-negative, and
become zero precisely when the predictions are “perfect.” Examples
include MSELoss
and CrossEntropyLoss
. But this is by no means
a requirement.
Consider, for example, optimizing with lossA = MSELoss
. Now
imagine optimizing with lossB = lossA - 17.2
. The 17.2
doesn’t
really change anything at all. It is true that “perfect” predictions
will yield lossB = -17.2
rather than zero. (lossA
will, of course,
be zero for “perfect” predictions.) But who cares?
Best.
K. Frank