# Using multiple losses for a model

I am training two losses for multi-label autoencoder. Both losses aim at different segments in the attention module of the model. Initially, I was thinking of combining the two losses into a single loss like;

`loss = loss1 - coef. * loss2 // having coef. to give less attention to loss2.`

But in this case, the total loss starts going negative and keeps increasing in negative after some epochs. So I am thinking of updating both the losses separately for the model like;

``````loss1.backward(retain_graph=true)
loss2.backward()
optimizer.step()
``````

I am looking for more suggestions and if I am missing something in my first approach (combined loss).

Clarifying question: Should both losses be minimized in your problem?

If so then it seems to me that you would want something like `loss = loss1 + coef. * loss2`. Seems to me that your coefficient setup is perfectly reasonable, it will no doubt need to be tuned to produce the right focus for your model though.

Mathematically speaking I think hitting both individual loss functions with `backward()` is equivalent to this loss: `loss = loss1 + loss2` as it will just accumulate the gradients together in the second backwards pass.

Thanks!
It was surprise to me that the model’s accuracy increased a lot even if the over all loss was too extreme to negative. I tested the model and based on it’s so good performance I am confused in justifying the behavior of the loss below zero as I haven’t seen any such examples so far.

@_joker in your original example you show subtracting on loss function from another. Most loss functions are usually defined to be minimized and not maximized. If you are subtracting a loss function then you are in-effect asking the optimizer to make that loss function as big as it can. This might result in your large negative loss values.

@zacharynew
True! The reason to subtract the losses is –
loss1 – encoded representation from the i th class label
loss2 – encoded representation from the non i th class label.

Here, I encourage the loss1 to be mapped to as close to hyper-ball center where each class is a p-norm ball so it has to be subtracted from loss1.