Combining optimally weighted multiple losses

Hi guys,

I’m currently designing a network that combines multiple losses.

My output consists of 4 channels, where the first channel’s loss should be calculated using the BCEWithLogitsLoss and the other three channels should be calculated using MAE (L1Loss).

Currently, my implementation looks as follows:

bceLoss = self.BCE.forward(x[0, 0, :], y[0, 0, :])
mae = self.MAE.forward(x[0, 1:, :], y[0, 1:, :])
return bceLoss + mae

The issue I’m getting is that I feel that the MAE loss outweighs the BCE loss by a lot, and the network is thus focussing more on reducing the MAE loss.
I know I can do something in the likes of: return bceLoss * alpha + mae * beta

The question I know have is how to know what alpha and beta should be? To weigh them properly.

Kind regards,

Hi Emil,

so one observation is that doubling alpha and beta just scales your loss, so maybe just setting one, say alpha, of them to 1 would be OK.
Of course, one way to be sure is to have an additional hold-out set (or do cross validation or so) and then try a few values beta and taking the best-performing one is “the gold standard”. If you really need to do it well and are prepared to expend effort for it, this is the thing to do.

As that can be a bit tedious, though, my intuitive approach would be to move the two loss components to be similar in size (doesn’t need to be exactly the same - won’t last anyway - but say within a factor of 2 or 3). If you don’t do this, the gradient descent will essentially optimize the larger loss and the smaller only has an influence once the larger does not change much. (This is a bit handwavy and I don’t mention gradient magnitude and all, but maybe it can be useful).

Best regards