As i got from the concept of captioning that i need the features for the images not to classify it. So if i have a classification model, the classification head can be dropped. does that code doing the concept ? the code here
the modifications are
def gradient(model, freeze: bool):
transformer = VisionTransformer(num_classes=0)
for parameter in transformer.parameters():
parameter.requires_grad_(not freeze)
def vit_small(patch_size=16, **kwargs):
model = VisionTransformer( patch_size=patch_size, embed_dim=384, depth=12, num_heads=6, mlp_ratio=4, qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs) gradient(model, freeze=True)
return model
Does that right ?
Does after that i need re -train the model for my datasets after the modifications ?
excuse me and thanks for replying … if i have a detection model and need to use it for captioning . Does i need freezing any layer of the model or the freezing is just for classification model ?
I don’t think there is a strict answer for this because it depends on what the entire model architecture is and not just the backbone. So you can try training with or without freezing though without freezing probably gives you better results.