Does feature extraction has an effect on the Learning Rate?

Hi all,

When I’m loading my pretrained model without freezing or removing any layers and using an lr of 0.001 I’m getting quite good results. Ex:


def get_trainable(model_params):
    return (p for p in model_params if p.requires_grad)

optimizer = torch.optim.Adam(

However, when I do freeze layers for feature extraction my doesn’t learn an all. Ex:

def set_parameter_requires_grad(model, feature_extracting):
    if feature_extracting:
        for param in model.parameters():
            param.requires_grad = False

n_classes = 5
set_parameter_requires_grad(model, True)
num_ftrs = model.classifier.in_features
model.classifier = nn.Linear(num_ftrs, n_classes)

There’s a third scenario I just ran into where I don’t freeze anything but specify the number of classes. In this case, my model overfits.

model.classifier = nn.Linear(num_ftrs, n_classes)

Any idea why this happens?

Hi, are you passing the whole model to the optimizer in the second case?
You have to.

What’s the diff between 3rd situation and 2nd? you seem to be specifying number of classes in both.

The difference between 2nd and 3rd is that for the 2nd scenario I’m using the set_parameter_requires_grad function.

Regarding optimization, I’m using this for both:

optimizer = torch.optim.Adam(model.parameters(),lr=0.001)

It’s still a bit confusing for me. How are you training the 1st case if you aren’t adding the last layer which maps to the amount of classes of your dataset?

It seems you are loading the network without modifying the final amount of classes. In case you do feature extraction this is ok.

Without modifications you are taking ImageNet weights. Those weights come from a very big dataset with more than 1k classes. That’s why it generalizes very well as feature extractor.

When you freeze the whole model but the final layer, your network is not capable to map those features to your classes.
In the 3rd case it seems your dataset is not big enough and you overfit.

How to solve this?
You can think firsts layers are more general, and the deeper the more specific the learned filters are. You can try to use a smaller LR for the pretrained part, such that those weights aren’t too modified and a normal learning rate for the last FC layer you are adding. You can even freeze first layers as those are truly very general.

Maybe this will help make things clearer:

Scenario 1
Trainable params: 11M
Output layer: 1000 classes
Outcome: Model learns, doesn’t overfit.

Scenario 2
Trainable params: 5k
Output layer: 5 classes
Outcome: Model learns, overfits.

Scenario 3
Trainable params: 11m
Output layer: 5 classes
Outcome: Model doesn’t learn.

I don’t mind the trainable params, the questions is why does the ouput layer influences the learning.