Hello everyone,
I am currently doing a deep learning research project and have a question regarding use of loss function. Basically, for my loss function I am using Weighted cross entropy + Soft dice loss functions but recently I came across with a mean IOU loss which works, but the problem is that it purposely return negative loss. First, it seemed odd to me that it returns -loss, so i changed the function to return 1-loss, but it performed worse so I believe the negative loss is correct approach. This means though that my final loss will be sum of positive, positive and negative values which seem to me very odd and donâ€™t really make sense but surprisingly working not badly.
Hence, during training, my loss values go below 0 as the training continues. My current guess is that it is working fine because the optimization of loss function is to reduce the gradient of loss to zero not the loss itself.
My question is, is it okay to use combination of positive and negative loss functions as what matters is just a gradient of my final loss function?
I used the following approach for IOU loss: How to implement soft-IoU loss?

Thank you and I look forward to hearing from someone to answer this question soon!

Yes, it is perfectly fine to use a loss that can become negative.
Your reasoning about this is correct.

To add a few words of explanation:

A smaller loss â€“ algebraically less positive or algebraically more
negative â€“ means (or should mean) better predictions. The
optimization step uses some version of gradient descent to make
your loss smaller. The overall level of the loss doesnâ€™t matter as
far as the optimization goes. The gradient tells the optimizer how
to change the model parameters to reduce the loss, and it doesnâ€™t
care about the overall level of the loss.

When gradient descent drives the loss to a minimum, the gradient
becomes zero (although it can be zero at places other than a
minimum). (Also, when the gradient is zero, plain-vanilla gradient
descent stops changing the parameters.)

It is true that several common loss functions are non-negative, and
become zero precisely when the predictions are â€śperfect.â€ť Examples
include MSELoss and CrossEntropyLoss. But this is by no means
a requirement.

Consider, for example, optimizing with lossA = MSELoss. Now
imagine optimizing with lossB = lossA - 17.2. The 17.2 doesnâ€™t
really change anything at all. It is true that â€śperfectâ€ť predictions
will yield lossB = -17.2 rather than zero. (lossA will, of course,
be zero for â€śperfectâ€ť predictions.) But who cares?