As i got from the concept of captioning that i need the features for the images not to classify it. So if i have a classification model, the classification head can be dropped. does that code doing the concept ? the code here
the modifications are
def gradient(model, freeze: bool):
transformer = VisionTransformer(num_classes=0)
for parameter in transformer.parameters():
def vit_small(patch_size=16, **kwargs):
model = VisionTransformer( patch_size=patch_size, embed_dim=384, depth=12, num_heads=6, mlp_ratio=4, qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs) gradient(model, freeze=True)
Does that right ?
Does after that i need re -train the model for my datasets after the modifications ?
What should result expected ?
Usually if you want to get feature maps from a model the typical approach is to edit the
forward function in the model definition to return the intermediate feature maps in addition to the final output: e.g., How to extract features of an image from a trained model - #6 by fmassa
Usually some kind of finetuning would be needed (at least), as classification is kind of a different domain from captioning.
excuse me and thanks for replying … if i have a detection model and need to use it for captioning . Does i need freezing any layer of the model or the freezing is just for classification model ?
I don’t think there is a strict answer for this because it depends on what the entire model architecture is and not just the backbone. So you can try training with or without freezing though without freezing probably gives you better results.