Help with mc3_18 model

Hi folks, I’m new to ML and pytorch, so please apologies in advance for some very beginner quesstions.

I’m trying to extract features from a video using this model, but I’m bit confused on how to use it.
Reading the docs, it seems the model accepts an input as (B, T, C, H, W), so this is what I’ve done to capture frames using opencv and convert them on the 5D tensor:

def preprocess_video_frames(video_path, frame_size=(224, 224)):
    frames = []
    capture = cv2.VideoCapture(video_path)
    while True:
        ret, frame =
        if not ret:

        # Resize each frame to the desired frame_size
        frame = cv2.resize(frame, frame_size)
        # Normalize pixel values to the range [0, 1]
        frame = frame / 255.0
        #frame = np.transpose(frame, (2, 0, 1))
        # Append the preprocessed frame to the list

    return frames

def load_and_preprocess_video(video_path):

    frame_list = preprocess_video_frames(video_path)
    video_frames = np.stack(frame_list) # Convert list into a 4D array
    video_tensor = torch.tensor(video_frames, dtype=torch.float32).permute(0, 3, 1, 2)

    # Extract features using the pre-trained I3D model
    with torch.no_grad():
        features = i3d_model(video_tensor)
    return features

I get Given groups=1, weight of size [64, 3, 3, 7, 7], expected input[1, 303, 3, 224, 224] to have 3 channels, but got 303 channels instead

This is where I’m a bit confused, I thought that this tensor would have (BxTxCxWxH) and hence the 3 channels dimension being correct.

Could someone shed a light on what I’m doing wrong?

Thank you

nn.Conv3d expects inputs as [N, C, D, H, W] as described in the docs.

Thanks I was looking at this page: mc3_18 — Torchvision main documentation
I’m afraid I do not understand well enough how to properly do this, my example seems to be way off, I’ll do some more digging on how to use the model for feature extraction.


From the link:

The inference transforms are available at MC3_18_Weights.KINETICS400_V1.transforms and perform the following preprocessing operations: Accepts batched (B, T, C, H, W) and single (T, C, H, W) video frame torch.Tensor objects.

In this case I might be wrong and the model seems to expect a different format than e.g. native nn.Conv3d layers. Could you post a minimal and executable code snippet reproducing the issue using random tensors?