Shape '[2, 96, 8, 56, 56]' is invalid for input of size 602112

class Model(torch.nn.Module):
def init(self):
super(Model, self).init()
self.weight = MViT_V2_S_Weights.DEFAULT
self.model = mvit_v2_s(weights=self.weight)
self.dense = torch.nn.Linear(in_features=400, out_features=1)
self.sigmoid = torch.nn.Sigmoid()

def forward(self, x, labels=None):
    x = self.model(x)
    x = self.dense(x)                                                
    x = self.sigmoid(x)
    if labels != None:
        labels = torch.reshape(labels, (x.shape[0], 1))*1.0
        loss = torch.nn.BCEWithLogitsLoss()(x, labels)
        return x, loss
    return x

After this i run this piece of code to do some random testing and i get this error…!!
model = Model().to(“cpu”)
model.eval()

model(torch.ones(2,3,32,224,224).to(DEVICE))

model.eval()

for i in range(1,100):
try:
y = model(torch.ones(2,3,i,224,224).to(“cpu”))
print(“select”, i)
except Exception as error:
print(error)
pass

ERROR - shape ‘[2, 96, 8, 56, 56]’ is invalid for input of size 602112

It seems the model is expecting a temporal size of 16:

model(torch.ones(2, 3, 16, 224, 224))

and fails with other sizes.
@pmeier do you know if this is expected?

Hey @whovivkrajput. MaxVit is an image model, but you are passing 5D input. This is not supported. From the documentation

Accepts PIL.Image, batched (B, C, H, W) and single (C, H, W) image torch.Tensor objects.

So if you remove the i from

y = model(torch.ones(2,3,224,224).to(“cpu”))

It should work.

mvit_v2_s seems to be a torchvision.models.video model described here which expects inputs in:

Accepts batched (B, T, C, H, W) and single (T, C, H, W) video frame torch.Tensor objects. The frames are resized to resize_size=[256] using interpolation=InterpolationMode.BILINEAR, followed by a central crop of crop_size=[224, 224]. Finally the values are first rescaled to [0.0, 1.0] and then normalized using mean=[0.45, 0.45, 0.45] and std=[0.225, 0.225, 0.225]. Finally the output dimensions are permuted to (..., C, T, H, W) tensors.

The interesting part is:

min_size height=224, width=224
min_temporal_size 16

which seems to be hard-coded as the code fails if the temporal size is changed.
Am I looking at the wrong model?

No you are right. I skimmed over that since the code is not properly formatted. @whovivkrajput could you please next time wrap your code in triple backticks?

I’m no expert here, but the clip_len=16 is set during training. So if you want to use our pretrained weights, you need to match that.

Thank you for your help… @pmeier @ptrblck …i do have few questions.

This model can be used for video recognition and looking at the input (2 , 3 , i , 224, 224) . i here is the frame length and according to your answer 16 is the clip_len…so can i use another value for clip_len other than 16???

so can i use another value for clip_len other than 16???

Nope. This is hardcoded here. You can open an issue if you need other clip lengths and thus being able to set this value in the builder of the model. Note however that this will not change the fact that our weights are trained for 16 frames and that won’t change. Meaning, you will have to train yourself.

For completion, here is full example on how to use the model:

from torchvision import models
import torch

name = "MViT_V2_S"
builder = models.get_model_builder(name)
weights = models.get_model_weights(name).DEFAULT

model = builder(weights=weights)
transform = weights.transforms()

input = torch.ones(2, 16, 3, 224, 224)
result = model(transform(input))

@whovivkrajput It seems you are not alone here. Someone raised the issue to us: The temporal size of MVIT_V2_S can not be greater than 16 · Issue #7345 · pytorch/vision · GitHub.