I’m using a pretrained network, built some layers on top of it, and then finetune. But I wonder if we can divide the gradients for the subnetwork only (the lower layers) by some factor to avoid catastrophic forgetting as soon as we start the training.
Yes, you could scale the gradients before the
optimizer.step() method and after calling
loss.backward() by accessing the
.grad attributes of the desired parameters (the gradient clipping utilities might also be interesting for your use case).
Alternatively, creating different optimizers with different learning rates (smaller learning rate for the “lower layers”) could also work.