This is more of a accuracy/algorithm question. I would appreciate some empirical/theoretical advices.
Hi, I’m trying to pretrain a ViT model on CIFAR100 (100 classes) from a random pretrained checkpoint I got from huggingface model hub. The model was good for Imagenet validation (80+% accuracy).
Because the pretrained weights were trained on Imagenet (1k classes), I had to re-initialize the final classifier layer of the pretrained model and fine-tune for the 100 classes for CIFAR100. Initially I thought only fine-tuning the re-initialized classifier layer and freezing the rest can save a lot of training time. So I went on with this strategy and it worked. I was able to get 80% test accuracy in 20 epochs and the accuracy did not really improve afterwards.
Afterwards, I “un-froze” the ViT base model parameters and finetuned further. My initial thinking was that the model should begin with 80% accuracy and improve with more epochs of finetuning. However, the accuracy tanked at the first epoch. The accuracy trend for this finetuning showed that it was similar to training a ViT from scratch.
My question here is, shouldn’t finetuning an unfrozen model be as good as when the model was partially-frozen? Does unfreezing the partially frozen layers for further finetuning harm accuracy and not help with faster convergence? Should partially unfreezing the model layers be considered as creating a whole new model?