Implementing differential learning rate by parameter groups


I am trying implement different learning rate across my network .
I am creating the parameter groups as follows:

Simple optimizer:

optimizer = optim.SGD(net.parameters(), lr=learning_rate )

Optimizer with parameter groups:

optimizer1 = optim.SGD([
    {'params': net.top_model[0:10].parameters(), 'lr': learning_rate/10, 'momentum': 0},
    {'params': net.top_model[10:31].parameters(), 'lr ':learning_rate/3 },
    {'params': net.linear1.parameters()},
    {'params': net.bn1.parameters()},  
], lr=learning_rate )

When I do

len(optimizer.param_groups[0]['params']) # I get 30
len(optimizer1.param_groups[0]['params'])  # I get 8

I don’t understand how Pytorch arrives at these numbers.
Could someone clarify if this is the right way to do and explain the difference in param_group numbers?

Thanks a ton!

My network:

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        vgg = models.vgg16(pretrained=Pretrained)
        layers = list(vgg.children())[0][:31]
        self.top_model = nn.Sequential(*layers).cuda()
        self.bn1 = nn.BatchNorm1d(512)
        self.linear1 = nn.Linear(512,10)

    def forward(self,x):
        x = F.relu(self.top_model(x))
        x = nn.AdaptiveAvgPool2d((1,1))(x)
        x = x.view(x.shape[0],-1)
        x = self.bn1(x)
        x = self.linear1(x)
        return x

The first 10 layers of vgg probably have 8 parameters (weight + bias for 4 conv layers) - those are the ones for the first param group when you split them -, the remaining part of vgg has 18 (9 more conv layers), bn1 and linear have 2 each (weight + bias). When you have a single parameter group (i.e. don’t split), you see those 30.

Best regards


1 Like

Thank you so much for the clarification @tom !