Grad through frozen weights

Joshua_Clancy · June 8, 2020, 6:32pm

I am a little confused, please help enlighten me.

Imagine an image model with cats and dogs data. I have seen models where there is a trainable autoencoder recreating the image, and then the output is passed through a frozen resnet encoder, with a trainable top linear layer categorizing cat and dog.

That is, we have the input travel through trainable layers, frozen layers, and then trainable layers again. How can that possibly train? What are the general rules around training through frozen weights? I have tested training an autoencoder model through a frozen decoder… that seems to work. But adding a trainable layer afterward seems to me to break the chain.

Follow up related question, I have seen situations where the input variable/tensor has a grad, while the weights/bias don’t have a grad (are frozen). Is this what allows differentiation through frozen layers to work?

Thanks!!!

ptrblck · June 10, 2020, 5:55am

Autograd will make sure to continue the backpropagation to the first parameter, which needs gradients even if frozen layers are between other trainable layers.
You could set the requires_grad attribute to True for the input, but that would force Autograd to backpropagate to the input, whether or not some layers are frozen.
If you don’t need the gradient in your input tensor, I would recommend not to set requires_grad=True for it.

kmsd · June 2, 2023, 4:19pm

In this case, is the backpropagation through the frozen layers the same as through the trainable layers? Or does autograd not take the derivative with respect to the frozen weights (only taking it with respect to the activations), in which case the computational load on frozen layers is half(?) that of the trainable layers?

In my case, I have a Transformer encoder (RoBERTa) with a trainable embedding layer, and everything subsequent is frozen. I want to know whether the backwards pass is the same as if everything were trainable, being that the embeddings need to be updated, or whether significant computation is saved on the backwards pass since all the transformer layers (and everything except the embeddings) are frozen.

ptrblck · June 2, 2023, 5:45pm

This is the case as seen in this post where the wgrad kernels are missing if I freeze an intermediate layer while the dgrad kernels are still visible.