Using nn.Sequential to use models.video.mvit_v2_s for feature extraction

Hi,
I’m trying to replace mc3_18 with mvit_v2_s from torchvision.models.video but getting tensor shape error. Here is my code:

import torch
import torch.nn as nn
from torchvision import models

# base = models.video.mc3_18(weights='DEFAULT', progress=True)
base = models.video.mvit_v2_s(weights='DEFAULT', progress=True)
base = nn.Sequential(*list(base.children())[:-1])

batch_size = 4  # Adjust as needed
dummy_input = torch.randn(batch_size, 3, 16, 224, 224)
output = base(dummy_input)
print("Model input shape:", dummy_input.shape)
print("Model output shape:", output.shape)

Without applying nn.Sequential on all the layers except the last, mvit_v2_s model works. Ideally it should work as mc3_18 is working with nn.Sequential and give the out with dim [4, 512, 1, 1, 1]. Using nn.Sequential with mvit_v2_s gives this error that I’m unable to debug.

def forward(self, x: torch.Tensor) -> torch.Tensor:
    411         class_token = self.class_token.expand(x.size(0), -1).unsqueeze(1)
--> 412         x = torch.cat((class_token, x), dim=1)
    413 
    414         if self.spatial_pos is not None and self.temporal_pos is not None and self.class_pos is not None:

RuntimeError: Tensors must have same number of dimensions: got 3 and 5

Is there any other way to use mvit_v2_s for features extraction?

Putting all layers into an nn.Sequential container might be tricky, as the original forward method is not trivial as seen here. E.g. you would need to make sure the used functional API calls are also added, e.g. the x.flatten(2) or .transpose() calls as well as the slicing etc.
If you want to extract features, you might want to use forward hooks instead.

1 Like

Ok thanks, will use forward hooks!