RuntimeError: The size of tensor a (65537) must match the size of tensor b (50177) at non-singleton dimension 1

I am trying to extract part of the pretrained mvit model but it’s giving me the following error RuntimeError: The size of tensor a (65537) must match the size of tensor b (50177) at non-singleton dimension 1. If I use x3d_l, it works just fine. Not sure what I am doing wrong.

class VideoModel(nn.Module):
            def __init__(self):
                super(VideoModel, self).__init__()
                model_name = 'mvit_base_32x3'#'mvit_base_32x3'   X3D_L - 77.44% and mvit_base_32x3 - 80.30 on kinetics400 
                original_model =  torch.hub.load('facebookresearch/pytorchvideo', model_name, pretrained=True)
                self.features = nn.Sequential(
                    *list(original_model.children())[:-1] #mvit
                    #*list(original_model.blocks.children())[:-1] #x3d
                )
                
            def forward(self, x):
                x = self.features(x)
                return x

model = VideoModel()
x = torch.randn(1, 3,32,256,256)
pred = model(x)

If I change the size of x to (1,3,32,224,224) I get NotImplementedError but the following code works and produces the [1, 400] prediction probability.

model_name = 'mvit_base_32x3'#'mvit_base_32x3'   X3D_L - 77.44% and mvit_base_32x3 - 80.30 on kinetics400 
original_model =  torch.hub.load('facebookresearch/pytorchvideo', model_name, pretrained=True)
x = torch.randn(1, 3,32,224,224)
pred = original_model(x)

Any suggestion will be a great help. Thanks.

Here you can see what can happen when you use nn.Sequential.

If you inspect here how the x3d is created, you will see that it uses nn.Modules. If you see my other answer then you will see why this works.

Now in contrast, MViT is using functional API on the forward pass. You can inspect the source code here.

If you use the original_model, then, as you said, you have to input the video in the shape BxCxTxHxW where C=3, H=224 and W=224.

Not changing anything

This will work.

model_name = 'mvit_base_32x3'
original_model =  torch.hub.load('facebookresearch/pytorchvideo', model_name, pretrained=True)

input_vid = torch.randn(1, 3,32,224,224)
original_model(input_vid).shape
# Output
torch.Size([1, 400])

Using nn.Sequential

Above I have linked another answer as to why it may not work with nn.Sequential.

features = torch.nn.Sequential(*list(original_model.children())[:-1])
features(input_vid)

As you said, this will raise the following error

--> 201     raise NotImplementedError

Modifying the head

If you print the original model, you can see that you can access the head like this

original_model.head
# Output
VisionTransformerBasicHead(
  (sequence_pool): SequencePool()
  (dropout): Dropout(p=0.5, inplace=False)
  (proj): Linear(in_features=768, out_features=400, bias=True)
)

Here you can see that original_model.head.proj is a Linear layer. You can modify this to fit your needs.

For example

model_name = 'mvit_base_32x3'
original_model =  torch.hub.load('facebookresearch/pytorchvideo', model_name, pretrained=True)

N_CLASSES = 10
original_model.head.proj = torch.nn.Linear(768, N_CLASSES)

input_vid = torch.randn(1, 3,32,224,224)
original_model(input_vid).shape

This will work.

# Output
torch.Size([1, 10])

If, for example you wanted to get rid of this Linear layer you could use nn.Identity() instead.

This will give you the output up until before the last Linear layer.

Thank you for your detailed explanation. I had no idea of this property.

1 Like