Which parameters to pass in optimizer for transfer learning?

Hello,

I am working on multiclass image classification where, I have a custom dataset with 13 classes such as Alien, Predator, Terminator, Robin, Batman, Superman, Spiderman, Valkyrie, Raven, BeastBoy, DeathStroke, Deadpool, PoisonIvy. I have around 5236 images for training and 1300 for validation. Each class has around approx 400 for training and 100 for validation. I went through Transfer Learning for Computer Vision Tutorial — PyTorch Tutorials 2.0.0+cu117 documentation and Pytorch’s fine tuning tutorial. I am using pretrained ConvNeXt model and I have unfreeze layer 6,7 of feature extractor and classifier layer 2

(

7): Sequential(
      (0): CNBlock(
        (block): Sequential(
          (0): Conv2d(768, 768, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=768)
          (1): Permute()
          (2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (3): Linear(in_features=768, out_features=3072, bias=True)
          (4): GELU(approximate='none')
          (5): Linear(in_features=3072, out_features=768, bias=True)
          (6): Permute()
        )
        (stochastic_depth): StochasticDepth(p=0.37714285714285717, mode=row)
      )
      (1): CNBlock(
        (block): Sequential(
          (0): Conv2d(768, 768, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=768)
          (1): Permute()
          (2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (3): Linear(in_features=768, out_features=3072, bias=True)
          (4): GELU(approximate='none')
          (5): Linear(in_features=3072, out_features=768, bias=True)
          (6): Permute()
        )
        (stochastic_depth): StochasticDepth(p=0.3885714285714286, mode=row)
      )
      (2): CNBlock(
        (block): Sequential(
          (0): Conv2d(768, 768, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=768)
          (1): Permute()
          (2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (3): Linear(in_features=768, out_features=3072, bias=True)
          (4): GELU(approximate='none')
          (5): Linear(in_features=3072, out_features=768, bias=True)
          (6): Permute()
        )
        (stochastic_depth): StochasticDepth(p=0.4, mode=row)
      )
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=1)
  (classifier): Sequential(
    (0): LayerNorm2d((768,), eps=1e-06, elementwise_affine=True)
    (1): Flatten(start_dim=1, end_dim=-1)
    (2): Linear(in_features=768, out_features=13, bias=True)
  )
)

Here, is the model summary

 └─Sequential (6)                                   [12, 384, 14, 14]    [12, 768, 7, 7]      --                   True
│    │    └─LayerNorm2d (0)                             [12, 384, 14, 14]    [12, 384, 14, 14]    768                  True
│    │    └─Conv2d (1)                                  [12, 384, 14, 14]    [12, 768, 7, 7]      1,180,416            True
│    └─Sequential (7)                                   [12, 768, 7, 7]      [12, 768, 7, 7]      --                   True
│    │    └─CNBlock (0)                                 [12, 768, 7, 7]      [12, 768, 7, 7]      4,763,136            True
│    │    └─CNBlock (1)                                 [12, 768, 7, 7]      [12, 768, 7, 7]      4,763,136            True
│    │    └─CNBlock (2)                                 [12, 768, 7, 7]      [12, 768, 7, 7]      4,763,136            True
├─AdaptiveAvgPool2d (avgpool)                           [12, 768, 7, 7]      [12, 768, 1, 1]      --                   --
├─Sequential (classifier)                               [12, 768, 1, 1]      [12, 13]             --                   True
│    └─LayerNorm2d (0)                                  [12, 768, 1, 1]      [12, 768, 1, 1]      1,536                True
│    └─Flatten (1)                                      [12, 768, 1, 1]      [12, 768]            --                   --
│    └─Linear (2)                                       [12, 768]            [12, 13]             9,997                True
=======================================================================================================================================
Total params: 49,464,685
Trainable params: 15,482,125
Non-trainable params: 33,982,560
Total mult-adds (G): 4.93
=======================================================================================================================================
Input size (MB): 7.23
Forward/backward pass size (MB): 2485.59
Params size (MB): 197.80
Estimated Total Size (MB): 2690.62

Why unfreeze CNN feature extractor layers?
Because, the custom dataset is completely new to the pretrained model(ConvNeXt). ConvNeXt has never seen such data. So, from my understanding its better to unfreeze last 2 layers of feature extractor to get some essential learnings specific to my custom data.

I have couple of questions

  1. Which parameters should I pass into the optimizer? should it be whole model parameters or parameters of the layers which I have unfrozen(layer[6,7] from feature extractor and feature classifier layers)

Approach 1 update weights of only unfrozen(layer[6,7] from feature extractor and feature classifier layers) while rest of the model weights are frozen

from torchvision import models
model = models.convnext_small(pretrained=True)
params_to_update = []

        for param in model.classifier.parameters():
            param.requires_grad = True
            params_to_update.append(param)

        for name, block in model.features.named_modules():
            if(name in finetune_features_layers):
                for param in block.parameters():
                    param.requires_grad = True
                    params_to_update.append(param)

# optimizer
optimizer = optim.Adam(
    params_to_update,
    lr = 0.0001
)

Approach 2 Update whole model weights. Observe below parameters of all layers will be optimized

from torchvision import models
model = models.convnext_small(pretrained=True)
params_to_update = model.parameters()
# optimizer
optimizer = optim.Adam(
    params_to_update,
    lr = 0.0001
)
  1. Does the size and type of the dataset matters in generalization which could essentially vary the number of CNN layers to keep frozen or unfroze couple of them?
  2. I trained my model for
Epoch: 49 
Train Loss: 1.156555 Acc: 0.6353
Elapsed 12323.04s, 246.46 s/epoch, 3.01 s/batch, ets 0.00s

Test set: Average loss: 0.8826, Accuracy: 930/1300 (72%)

Model Improved. Saving the Model...

But when I am evaluating this newly trained model on test data(completely new/fresh unseen custom data with the same 13 classes as above). I get around 10% of accuracy. I am trying to understand what is going wrong here? I checked the dataset both training and validation they have right class labels and right images.

Is my understanding of feature extraction and fine tuning correct? Am I heading in the right direction?

  1. I would prefer the explicit approach in passing only the trainable parameters to the optimizer assuming you don’t want to “unfreeze” other parameters later in the training.

  2. Yes, the dataset and “type” will certainly matter during finetuning in a similar way it would matter when trying to train a model from scratch.

  3. Could you reuse your training data during this testing step to make sure you are still able to achieve the previously reported accuracy and loss?

Actually, on training my model for longer time around 70 epochs Train accuracy reaches 95% and Validation accuracy reaches upto 79% max. I got around 20% accuracy on training set when I did model evaluation.