RuntimeError: The size of tensor a (65537) must match the size of tensor b (50177) at non-singleton dimension 1

Here you can see what can happen when you use nn.Sequential.

If you inspect here how the x3d is created, you will see that it uses nn.Modules. If you see my other answer then you will see why this works.

Now in contrast, MViT is using functional API on the forward pass. You can inspect the source code here.

If you use the original_model, then, as you said, you have to input the video in the shape BxCxTxHxW where C=3, H=224 and W=224.

Not changing anything

This will work.

model_name = 'mvit_base_32x3'
original_model =  torch.hub.load('facebookresearch/pytorchvideo', model_name, pretrained=True)

input_vid = torch.randn(1, 3,32,224,224)
original_model(input_vid).shape
# Output
torch.Size([1, 400])

Using nn.Sequential

Above I have linked another answer as to why it may not work with nn.Sequential.

features = torch.nn.Sequential(*list(original_model.children())[:-1])
features(input_vid)

As you said, this will raise the following error

--> 201     raise NotImplementedError

Modifying the head

If you print the original model, you can see that you can access the head like this

original_model.head
# Output
VisionTransformerBasicHead(
  (sequence_pool): SequencePool()
  (dropout): Dropout(p=0.5, inplace=False)
  (proj): Linear(in_features=768, out_features=400, bias=True)
)

Here you can see that original_model.head.proj is a Linear layer. You can modify this to fit your needs.

For example

model_name = 'mvit_base_32x3'
original_model =  torch.hub.load('facebookresearch/pytorchvideo', model_name, pretrained=True)

N_CLASSES = 10
original_model.head.proj = torch.nn.Linear(768, N_CLASSES)

input_vid = torch.randn(1, 3,32,224,224)
original_model(input_vid).shape

This will work.

# Output
torch.Size([1, 10])

If, for example you wanted to get rid of this Linear layer you could use nn.Identity() instead.

This will give you the output up until before the last Linear layer.