I have currently the following network with two losses:
I have a single optimizer and found that multiplying by a scalar one of the losses leads to better results such that:
L_total = lambda L1 + L2
I have the hypothesis that the lambda hyperparameter should only affect the shared representation. Therefore, I would like to “apply” it after both separated networks backpropagation. My questions are:
- How to apply the lambda to the loss after each of the single networks’ backpropagation so that only the shared representation is affected by (lambda L1 + L2). Is multiplying the gradients that come from L1 by lambda enough?
- Does that make any sense?
@deepfailure Multiplying with a scalar will not affect the overall loss. The only way it could have affected is if you have a backpropagation on the total loss and the lambda scalar is a negative number
Perhaps you misunderstood me.
The scalar is multiplied only on one of the two losses. I think this is a standard practice used in a lot of papers so it should work.
For instance, in the UNREAL paper various losses are used and each loss is multiplied by a different scalar: https://arxiv.org/pdf/1611.05397.pdf
I don’t have any idea applying
lambda L1 + L2 only to the shared representation.
I think the paper you referred (https://arxiv.org/pdf/1611.05397.pdf) does not seem to follow the method you mentioned.
The authors calculated a loss by multiplying lambda to L1, that’s all.
Consequently, L_total = lambda L1 + L2 is enough.