I have currently the following network with two losses: I have a single optimizer and found that multiplying by a scalar one of the losses leads to better results such that:

L_total = lambda L1 + L2

I have the hypothesis that the lambda hyperparameter should only affect the shared representation. Therefore, I would like to “apply” it after both separated networks backpropagation. My questions are:

1. How to apply the lambda to the loss after each of the single networks’ backpropagation so that only the shared representation is affected by (lambda L1 + L2). Is multiplying the gradients that come from L1 by lambda enough?
2. Does that make any sense?

Thank you.

@deepfailure Multiplying with a scalar will not affect the overall loss. The only way it could have affected is if you have a backpropagation on the total loss and the lambda scalar is a negative number

Perhaps you misunderstood me.

The scalar is multiplied only on one of the two losses. I think this is a standard practice used in a lot of papers so it should work.

For instance, in the UNREAL paper various losses are used and each loss is multiplied by a different scalar: https://arxiv.org/pdf/1611.05397.pdf

I don’t have any idea applying `lambda L1 + L2` only to the shared representation.
I think the paper you referred (https://arxiv.org/pdf/1611.05397.pdf) does not seem to follow the method you mentioned.

The authors calculated a loss by multiplying lambda to L1, that’s all.
Consequently, L_total = lambda L1 + L2 is enough.