How to get embeddings from MViT as encoder

I am trying to use MViT as an encoder to get embeddings of an input to then pass into another model. I am looking at pytorchvideo/vision_transformers.py at main · facebookresearch/pytorchvideo · GitHub and the create_multiscale_vision_transformers function to build the mvit model. Something like this:

model = create_multiscale_vision_transformers(spatial_size=100, temporal_size=10)

Then, when I pass in a tensor, to the model like

output = model(tensor)

and get the shape

output.shape

I get a tensor of dim 400. This looks like the logits for the classification task.

The question is, I do I get the embeddings insteads of the logits. That way, I would be able to use the embeddings as features for another model. Is there a way to get the embedding projects to be of a certain dimension, like 1024?

Thank you for the help.

Going over the codebase for the MViT model you will see that the head layer towards the end is just a linear layer defined over here. This head layer can be directly used to create embeddings without making any changes in the model definition, just pass the embedding dimensions you want as head_num_classes here:

MViT_B = create_multiscale_vision_transformers(
            spatial_size=spatial_size,
            temporal_size=temporal_size,
            head_num_classes=head_num_classes,
        )

If you wish to apply certain normalization techniques to your output embeddings, that can be done either after the forward call, or you can modify the head definition above.

To do this, just use:

MViT_B = create_multiscale_vision_transformers(
            spatial_size=spatial_size,
            temporal_size=temporal_size,
            head_num_classes=1024,
        )

And train the model, it should work well.