Here you can see what can happen when you use nn.Sequential.
If you inspect here how the x3d is created, you will see that it uses nn.Modules. If you see my other answer then you will see why this works.
Now in contrast, MViT is using functional API on the forward pass. You can inspect the source code here.
If you use the original_model, then, as you said, you have to input the video in the shape BxCxTxHxW where C=3, H=224 and W=224.
Not changing anything
This will work.
model_name = 'mvit_base_32x3'
original_model = torch.hub.load('facebookresearch/pytorchvideo', model_name, pretrained=True)
input_vid = torch.randn(1, 3,32,224,224)
original_model(input_vid).shape
# Output
torch.Size([1, 400])
Using nn.Sequential
Above I have linked another answer as to why it may not work with nn.Sequential.
features = torch.nn.Sequential(*list(original_model.children())[:-1])
features(input_vid)
As you said, this will raise the following error
--> 201 raise NotImplementedError
Modifying the head
If you print the original model, you can see that you can access the head like this
original_model.head
# Output
VisionTransformerBasicHead(
(sequence_pool): SequencePool()
(dropout): Dropout(p=0.5, inplace=False)
(proj): Linear(in_features=768, out_features=400, bias=True)
)
Here you can see that original_model.head.proj is a Linear layer. You can modify this to fit your needs.
For example
model_name = 'mvit_base_32x3'
original_model = torch.hub.load('facebookresearch/pytorchvideo', model_name, pretrained=True)
N_CLASSES = 10
original_model.head.proj = torch.nn.Linear(768, N_CLASSES)
input_vid = torch.randn(1, 3,32,224,224)
original_model(input_vid).shape
This will work.
# Output
torch.Size([1, 10])
If, for example you wanted to get rid of this Linear layer you could use nn.Identity() instead.
This will give you the output up until before the last Linear layer.