I would like to fine tune the models in the torchvision model zoo on my own dataset. I need to set different learning rate on the original layers and the modified classifier layer. I modified the resnet model like this:
model = torchvision.models.resnet101(pretrained = False)
model.fc = nn.Linear(in_features = 2048, out_features = 10)
how could I design the optimizer?
This seems not work
optimizer = torch.optim.SGD(
{'params': model.parameters()[:-1], 'lr': 1e-4, 'momentum': 0.9, 'weight_decay': 1e-4},
{'params': model.parameters()[-1], 'lr': 5e-3, 'momentum': 0.9, 'weight_decay': 1e-4},)
I think there are some minor errors (missing bracket and the slice op on a generator).
This should work:
optimizer = torch.optim.SGD([
{'params': list(model.parameters())[:-1], 'lr': 1e-4, 'momentum': 0.9, 'weight_decay': 1e-4},
{'params': list(model.parameters())[-1], 'lr': 5e-3, 'momentum': 0.9, 'weight_decay': 1e-4}
])
1 Like
Got it, but consider a weird case of Resnet 101
ResNet(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential( ... )
(layer2): Sequential( ... )
(layer3): Sequential( ... )
(layer4): Sequential( ... )
(avgpool): AvgPool2d(kernel_size=7, stride=1, padding=0)
(fc): Linear(in_features=2048, out_features=10, bias=True)
If I need to fine tune the parameters of layer2 with lr = 1e-3 while finetune the other parameters with lr = 1e-4. How could I write them to optimizer then?
In this case, I would set the learning rate for layer2
and use the default for all others as shown in the docs.
Here I’ve created a small example, how to filter out special layers.
Thank you so much, that is very helpful!
For the resnet model, I believe this method might not be sufficient. The final Linear layer would have 2 parameters, the weight and the bias. By using model.parameters()[-1], you are using a different LR just for the bias term of the final Linear layer.