Example of torchvision video classify model

Any example of how to use the video classify model of torchvision?

pytorch version : 1.7.1
os : win10 64

Trying to forward the data into video classification by following script

import numpy as np
import torch
import torchvision

model = torchvision.models.video.r3d_18(pretrained=True, progress=True)

img = torch.zeros((16, 3, 112, 112))
results = model(img)  

I got error messages

RuntimeError: Expected 5-dimensional input for 5-dimensional weight [64, 3, 3, 7, 7], but got 4-dimensional input of size [3, 16, 112, 112] instead

Afte I change it to [64, 3, 3, 7, 7], I can export the model, but the training codes and the doc, both use [3,16,112,112], this is weird, why this happen?How could I use this model properly?


When the docs say [3,16,112,112] they are not including the batch size in those dimensions. You still need a batch size for the input so it would look more like this [64,3,16,112,112]. When the error comes back and says you need that shape [64, 3, 3, 7, 7] you can pretty much ignore that shape because it doesn’t matter. All you need to focus on is the number of dimensions in this case and just use the image dimensions from the docs.

1 Like

Anybody knows what dimension needed in r3d18 model?
the required dimension is 5, I guess the first dim is batch.
Then, what is the last 4? I tried [Batch X frame X filters X W X H] but not worked… it looks like dim 2,3 and 4,5 has same groups…?