Hi there!
I am fairly new to Pytorch and I am trying to provide a different learning rate for the parameters from BERT, and the rest of the model’s parameters to have the same lr.
My model class looks like this (it’s from a tutorial):
class BERTGRUModel(nn.Module):
def __init__(self,
bert,
hidden_dim,
output_dim,
n_layers,
bidirectional,
dropout):
super().__init__()
self.bert = bert
embedding_dim = bert.config.to_dict()['hidden_size']
self.rnn = nn.GRU(embedding_dim,
hidden_dim,
num_layers = n_layers,
bidirectional = bidirectional,
batch_first = True,
dropout = 0 if n_layers < 2 else dropout)
self.dropout = nn.Dropout(dropout)
self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
And I am trying to give the different learning rates like this:
optimizer = optim.Adam([{'params': model.rnn.parameters(), 'lr': 0.001},{'params': model.out.parameters(), 'lr': 0.001}, {'params': model.dropout.parameters(), 'lr': 0.001}, {'params': model.bert.parameters(), 'lr': 1e-5}])
But this performs worse than simply giving:
optimizer = optim.Adam(model.parameters())
It gives a worse accuracy and the precision and recall drop down to 0 (for each batch).
I’m sure I am doing something wrong, I just can’t figure out what.