I am trying to use MViT as an encoder to get embeddings of an input to then pass into another model. I am looking at https://github.com/facebookresearch/pytorchvideo/blob/main/pytorchvideo/models/vision_transformers.py and the create_multiscale_vision_transformers function to build the mvit model. Something like this:
model = create_multiscale_vision_transformers(spatial_size=100, temporal_size=10)
Then, when I pass in a tensor, to the model like
output = model(tensor)
and get the shape
output.shape
I get a tensor of dim 400. This looks like the logits for the classification task.
The question is, I do I get the embeddings insteads of the logits. That way, I would be able to use the embeddings as features for another model. Is there a way to get the embedding projects to be of a certain dimension, like 1024?
Thank you for the help.