How to normalize losses of different scale

hankdikeman · July 17, 2021, 12:40am

I have two losses, one is normal MSELoss and another is a custom loss function that I have made for regularization.

The problem that I have is that these losses are not necessarily on the same numerical scale, so I have to figure out how to weight them every time (and divide/multiply one by a constant so they are the same scale). I would much prefer if I could set the relative weight of each loss once and not worry about it again, such that loss1 constitutes 99% of my loss value and loss2 constitutes 1% (so some sort of normalized weighting)

I had previously added the two different loss functions together like this:

batch_loss = reconstruction_loss + monotonic_loss

But instead I want to normalize the losses so I can choose how much they contribute to parameter updates, this is what I was thinking.

batch_loss = CombineLosses([reconstruction_loss, monotonic_loss], [0.99,0.01])

def CombineLosses(losses, weights):
    combined_loss = torch.Tensor([0])
    for loss, wt in zip(losses, weights):                                                             
        combined_loss = combined_loss + ((loss * wt) / loss.item())                                   

    return combined_loss

The idea is that then I can weight them and if my regularization function sums to 10E8 and MSELoss to 0.16 I can still incorporate both. However, I might have a lack of understanding about how autograd works so I have a couple questions:

Will scaling my two loss functions actually limit each loss’s contribution proportionally to it’s weight? If not, what is the preferred way to combine losses of different scales in PyTorch?
As an alternative, could it be possible to alternate between training on one loss function and training on the other? i.e. switching off epochs

Thanks in advance

eqy · July 17, 2021, 3:23am

Scaling should make a difference.
e.g., consider the trivial scenario where L = (x-y)^2 vs. L= 5(x-y)^2.
dL/dx = 2(x-y) in the first and dL/dx = 10(x-y) in the second.

There is nothing wrong with the second approach, and there are many variations such as freezing part of the model trained on another method and finetuning a different part with a different objective. Of course, experimentation is probably the way to confirm what works and what doesn’t work well.

The tool of weighting losses can seem crude, but last I checked even huge models with countless objectives and sub-tasks (e.g., Tesla Autopilot) often balance the objectives in this way, often with hand-tuned weighting.

hankdikeman · July 17, 2021, 5:31am

Thank you, I appreciate the help. I was somewhat worried it would make training difficult if one objective had a gradient which was larger by a factor of 10 or 100 million, glad there is a method to weight them and it’s not completely off-the-wall to do so!