Let’s say I have 6 (transformer) layers (i.e., L0, L1, L2, L3, L4, L5), all the transformer layer now have require_grads = True, Now I do a forward pass to all the layers. Based on the representations, I might don’t want to calculate the gradients for some layers (to save some time) let say (L0, L1 and L2) i.e., I want to stop the gradient calculation at L2.
To perform this, initially I have all the layers have require_grads = True, after forward pass, let say I don’t want to calculate the gradients for L0, L1 & L2, I put the L0, L1 & L2 require_grads = False. Even after doing this the program takes the same time to perform the task (I have check the gradient they are “None”, they not calculated). Can anyone help me out with this. What I am doing wrong here. Or is there any other way to do this?
Note: I put the word embedding layer require_grads = False.
Code:
for param in student_model.distilbert.embeddings.parameters():
param.requires_grad = False
for param in student_model.distilbert.transfomer.layer[:curr_flag].parameters():