Any example of how to use the video classify model of torchvision?
pytorch version : 1.7.1
os : win10 64
Trying to forward the data into video classification by following script
import numpy as np
model = torchvision.models.video.r3d_18(pretrained=True, progress=True)
img = torch.zeros((16, 3, 112, 112))
results = model(img)
I got error messages
RuntimeError: Expected 5-dimensional input for 5-dimensional weight [64, 3, 3, 7, 7], but got 4-dimensional input of size [3, 16, 112, 112] instead
Afte I change it to [64, 3, 3, 7, 7], I can export the model, but the training codes and the doc, both use [3,16,112,112], this is weird, why this happen?How could I use this model properly?
When the docs say [3,16,112,112] they are not including the batch size in those dimensions. You still need a batch size for the input so it would look more like this [64,3,16,112,112]. When the error comes back and says you need that shape [64, 3, 3, 7, 7] you can pretty much ignore that shape because it doesn’t matter. All you need to focus on is the number of dimensions in this case and just use the image dimensions from the docs.
Anybody knows what dimension needed in r3d18 model?
the required dimension is 5, I guess the first dim is batch.
Then, what is the last 4? I tried [Batch X frame X filters X W X H] but not worked… it looks like dim 2,3 and 4,5 has same groups…?
Pytorch video models usually require shape [batch_size, channel, number_of_frame, height, width]. We can verify it with PytorchVideo. As known, Pytorch Hubs provides many pre-trained models and how to use them. In this example, pre-trained model requires shape [batch_size, channel, number_of_frame, height, width].
from pytorchvideo.data.encoded_video import EncodedVideo
start_sec = 0
end_sec = start_sec + 3
video = EncodedVideo.from_path("/coin_subset/0/-Vjo-WUdJyU.mp4")
video_data = video.get_clip(start_sec=start_sec, end_sec=end_sec)
inputs = video_data["video"].to(device)
# torch.Size([3, 90, 720, 1280])
from pytorchvideo.transforms import (
from torchvision.transforms import Compose
from torchvision.transforms._transforms_video import (
transform = ApplyTransformToKey(
video_data = transform(video_data)
inputs = video_data["video"].to(device)[:, :10, :, :]
model = torch.hub.load('facebookresearch/pytorchvideo', 'slow_r50', pretrained=True).to(device)
out = model(inputs)
# torch.Size([1, 400])
You can also review this video classification tutorial to see a working example.