maybe, in my case, I should not be setting requires_grad=False
to the L2
parameters, instead I must exclude all L2 parameters from optimizer. That way, right amount of gradients will flow back to L1’s params, but optimizer does not update L2 parameters (which is analogous to freezing L2, yet keeping L1 trainable)
Is this a correct approach ?