Using multiple losses for a model

_joker · April 12, 2020, 7:27pm

I am training two losses for multi-label autoencoder. Both losses aim at different segments in the attention module of the model. Initially, I was thinking of combining the two losses into a single loss like;

loss = loss1 - coef. * loss2 // having coef. to give less attention to loss2.

But in this case, the total loss starts going negative and keeps increasing in negative after some epochs. So I am thinking of updating both the losses separately for the model like;

loss1.backward(retain_graph=true)
loss2.backward()
optimizer.step()

I am looking for more suggestions and if I am missing something in my first approach (combined loss).

zacharynew · April 13, 2020, 12:13am

Clarifying question: Should both losses be minimized in your problem?

If so then it seems to me that you would want something like loss = loss1 + coef. * loss2. Seems to me that your coefficient setup is perfectly reasonable, it will no doubt need to be tuned to produce the right focus for your model though.

Mathematically speaking I think hitting both individual loss functions with backward() is equivalent to this loss: loss = loss1 + loss2 as it will just accumulate the gradients together in the second backwards pass.

_joker · April 13, 2020, 12:21am

Thanks!
It was surprise to me that the model’s accuracy increased a lot even if the over all loss was too extreme to negative. I tested the model and based on it’s so good performance I am confused in justifying the behavior of the loss below zero as I haven’t seen any such examples so far.

zacharynew · April 13, 2020, 12:25am

@_joker in your original example you show subtracting on loss function from another. Most loss functions are usually defined to be minimized and not maximized. If you are subtracting a loss function then you are in-effect asking the optimizer to make that loss function as big as it can. This might result in your large negative loss values.

_joker · April 13, 2020, 12:34am

@zacharynew
True! The reason to subtract the losses is –
loss1 – encoded representation from the i th class label
loss2 – encoded representation from the non i th class label.

Here, I encourage the loss1 to be mapped to as close to hyper-ball center where each class is a p-norm ball so it has to be subtracted from loss1.