Can I split the network, and Tune different layers with different learning rates?

I am trying to update the feature extractor and classifier with different learning rates, basically I split the network and do it. In optimizer I used param groups for optimizing the network.

resnet50 = models.resnet50(pretrained=True)
classifier = nn.Sequential(OrderedDict([("classifier", nn.Linear(1000, 31))]))

resnet50 = nn.Sequential(resnet50, classifier)
featExtractor = nn.Sequential(*(list(resnet50.children())[:-1])).cuda()
classifierModel = nn.Sequential(*(list(resnet50.children())[-1:])).cuda()

clf_optim = torch.optim.Adam([{'params': featExtractor.parameters(), 'lr':1e-4},
                             {'params': classifierModel.parameters()}], lr=5e-4)

for epoch in trange(epochs, leave=False):

    for _ in trange(iterations, leave=False):
        source_x, source_y = next(iter(amazonData))
        source_x, source_y =,

        for _ in range(k_clf):
            features = featExtractor(source_x)
            out = classifierModel(features)
            clf_loss = clf_criterion(out, source_y)

    print("total_loss: ", clf_loss)

Ps: I know I could have just used resnet50 instead of featExtractor and classifier instead of classifierModel, but this is just shortest version of what I am doing, and I am basically looking to validate the idea ?

@ptrblck what do you think ?

The code looks basically alright and the per-parameter option should work, too.
However, I’m not sure, if it’s the best idea to just add another custom classifier on top on the pre-trained model. Usually you would remove the last linear layer and replace it with a new one.

PS: I’m not a big fan of tagging certain people as this might demotivate others to write an answer. :wink:

I am sorry, I will definitely avoid tagging.

Yes, I understand that usually we remove the pretrained fc layer. But, I don’t want to directly go from 2048 -> 31.

So, do you mean to say replace the last trained fc layer and add 2 custom fc layers ? like 2048-1000-31 ??

Yeah, that’s a good point and I’m really not sure which would work the best.
Just based on my gut feeling, I would assume both described alternatives might work better than going from the pre-trained 1000 class output to your custom layer.

Could you post your results in case you are trying some different approaches as this is quite interesting. :slight_smile:

sure, I will definitely post it once I am done with my training. I am working on it.

1 Like