I have currently the following network with two losses:

I have a single optimizer and found that multiplying by a scalar one of the losses leads to better results such that:

L_total = lambda L1 + L2

I have the hypothesis that the lambda hyperparameter should only affect the shared representation. Therefore, I would like to “apply” it after both separated networks backpropagation. My questions are:

How to apply the lambda to the loss after each of the single networks’ backpropagation so that only the shared representation is affected by (lambda L1 + L2). Is multiplying the gradients that come from L1 by lambda enough?

@deepfailure Multiplying with a scalar will not affect the overall loss. The only way it could have affected is if you have a backpropagation on the total loss and the lambda scalar is a negative number

I don’t have any idea applying lambda L1 + L2 only to the shared representation.
I think the paper you referred (https://arxiv.org/pdf/1611.05397.pdf) does not seem to follow the method you mentioned.

The authors calculated a loss by multiplying lambda to L1, that’s all.
Consequently, L_total = lambda L1 + L2 is enough.