Here you can see what can happen when you use nn.Sequential
.
If you inspect here how the x3d
is created, you will see that it uses nn.Modules
. If you see my other answer then you will see why this works.
Now in contrast, MViT is using functional API on the forward
pass. You can inspect the source code here.
If you use the original_model
, then, as you said, you have to input the video in the shape BxCxTxHxW
where C=3
, H=224
and W=224
.
Not changing anything
This will work.
model_name = 'mvit_base_32x3'
original_model = torch.hub.load('facebookresearch/pytorchvideo', model_name, pretrained=True)
input_vid = torch.randn(1, 3,32,224,224)
original_model(input_vid).shape
# Output
torch.Size([1, 400])
Using nn.Sequential
Above I have linked another answer as to why it may not work with nn.Sequential
.
features = torch.nn.Sequential(*list(original_model.children())[:-1])
features(input_vid)
As you said, this will raise the following error
--> 201 raise NotImplementedError
Modifying the head
If you print the original model, you can see that you can access the head like this
original_model.head
# Output
VisionTransformerBasicHead(
(sequence_pool): SequencePool()
(dropout): Dropout(p=0.5, inplace=False)
(proj): Linear(in_features=768, out_features=400, bias=True)
)
Here you can see that original_model.head.proj
is a Linear
layer. You can modify this to fit your needs.
For example
model_name = 'mvit_base_32x3'
original_model = torch.hub.load('facebookresearch/pytorchvideo', model_name, pretrained=True)
N_CLASSES = 10
original_model.head.proj = torch.nn.Linear(768, N_CLASSES)
input_vid = torch.randn(1, 3,32,224,224)
original_model(input_vid).shape
This will work.
# Output
torch.Size([1, 10])
If, for example you wanted to get rid of this Linear
layer you could use nn.Identity()
instead.
This will give you the output up until before the last Linear
layer.