Example of torchvision video classify model

Any example of how to use the video classify model of torchvision?

pytorch version : 1.7.1
os : win10 64

Trying to forward the data into video classification by following script

import numpy as np
import torch
import torchvision

model = torchvision.models.video.r3d_18(pretrained=True, progress=True)
model.eval()

img = torch.zeros((16, 3, 112, 112))
results = model(img)  

I got error messages

RuntimeError: Expected 5-dimensional input for 5-dimensional weight [64, 3, 3, 7, 7], but got 4-dimensional input of size [3, 16, 112, 112] instead

Afte I change it to [64, 3, 3, 7, 7], I can export the model, but the training codes and the doc, both use [3,16,112,112], this is weird, why this happen?How could I use this model properly?

Thanks

When the docs say [3,16,112,112] they are not including the batch size in those dimensions. You still need a batch size for the input so it would look more like this [64,3,16,112,112]. When the error comes back and says you need that shape [64, 3, 3, 7, 7] you can pretty much ignore that shape because it doesn’t matter. All you need to focus on is the number of dimensions in this case and just use the image dimensions from the docs.

1 Like

Anybody knows what dimension needed in r3d18 model?
the required dimension is 5, I guess the first dim is batch.
Then, what is the last 4? I tried [Batch X frame X filters X W X H] but not worked… it looks like dim 2,3 and 4,5 has same groups…?

Pytorch video models usually require shape [batch_size, channel, number_of_frame, height, width]. We can verify it with PytorchVideo. As known, Pytorch Hubs provides many pre-trained models and how to use them. In this example, pre-trained model requires shape [batch_size, channel, number_of_frame, height, width].

from pytorchvideo.data.encoded_video import EncodedVideo

start_sec = 0
end_sec = start_sec + 3

video = EncodedVideo.from_path("/coin_subset/0/-Vjo-WUdJyU.mp4")
video_data = video.get_clip(start_sec=start_sec, end_sec=end_sec)
inputs = video_data["video"].to(device)
inputs.shape
# torch.Size([3, 90, 720, 1280])
from pytorchvideo.transforms import (
    ApplyTransformToKey,
    ShortSideScale,
)
from torchvision.transforms import Compose
from torchvision.transforms._transforms_video import (
    CenterCropVideo
)

transform =  ApplyTransformToKey(
    key="video",
    transform=Compose([
            ShortSideScale(size=256),
            CenterCropVideo(crop_size=(200, 200))
        ]),
)
video_data = transform(video_data)
inputs = video_data["video"].to(device)[:, :10, :, :]

model = torch.hub.load('facebookresearch/pytorchvideo', 'slow_r50', pretrained=True).to(device)
out = model(inputs)
out.shape
# torch.Size([1, 400])

You can also review this video classification tutorial to see a working example.