I am wondering how to implement the following scenario:
I am executing my neural network to compute a loss. Additionally, I compute a regularizer on every hidden layer’s output. Crucially, this regularizer should only affect that individual layer’s weights. The regularizing gradient for a layer should not propagate backwards through the network and influence previous layer’s parameters.
One option I can see is to compute each layer’s forward pass twice. Once for forward propagation and computation of the loss. Then the second time, with the input to that layer detached, for the computation of the regularizer. That seems wasteful though.
The second option I have is to loop over several optimizers, each only responsible for a given layer’s parameters. (Or equivalently zeroing gradients after each regularizers individual backwards() call.
This option is wasteful though because we backpropagate through the network way to many times.
Finally, I am sure I could hijack the backwards() call for each layer and modify the gradient manually, however this is non-trivial for my given regularizer and I would like to not give up the comfort of to auto-grad.
In conclusion I would like the option to tell the backwards() call which parameters I would like the gradients to be computed for.
Thank you for your help