I’d like to train only several rows of a linear layer (the linear layer is a classification head on top of BERT for multi-label classification, and I’d like to only train the head further for 10 labels). I tried the following:
model.classifier.weight[other_labels_indices,:].requires_grad = False model.classifier.bias[other_labels_indices].requires_grad = False
In other words, I’d like to freeze the parameters of the classifier for the “other labels”, which already have good performance, and only train the parameters corresponding to the labels that are currently underperforming.
However, this gave me the following warning:
RuntimeError: you can only change requires_grad flags of leaf variables. If you want to use a computed variable in a subgraph that doesn't require differentiation use var_no_grad = var.detach().
I then tried the following:
However, it seems that these parameters are still being updated. Am I doing something wrong?
I am afraid you won’t be able to do that.
The “requires_grad” property is associated with the whole Tensor. And cannot be set for subsets.
You have a few approaches here:
- Compute the gradient for the full layer and then zero-out the gradients before doing the optimizer update
- If your optimizer still update the value when there is 0 gradient, then it is simpler to store the original value of the rows you want to save. Then do the full gradient computation and update with the optimizer and then write back the rows that you didn’t want changed.
- The last option is to have a different Tensor (that is going to be your Parameter) that contains all the rows that you want learnt. And at every forward, write this Tensor into a bigger Tensor that also contains the frozen entries. And do training with that.
Thanks! I tried the first approach and it seems to work.
Hi, I’m in a similar situation. These approaches work well, but they seem to be very slow.
My solution was to manually program the gradient descent, but it seems a bit limited, and wanted to use optimizers to use more advanced gradient descent algorithms.
Using the optimizer and zeroing the gradients of all the parameters I don’t want to update takes twice the time it used to take, and also I’ve done some tests and I think the part that takes long to run is actually zeroing all the gradients.
Actually, I was able to solve this.
Instead of zeroing the gradient of all gradients using a for loop, now I do it with a matrix multiplication on GPU, and now it’s much faster.
Now I basically do
weights.grad = torch.mm(matrix, weights.grad)
where the matrix is a matrix full of 0s and some 1s on specific positions to keep the gradients of the parameters I want.