Import pytorchvideo transformer model

szahan · April 4, 2022, 5:30am

Hi

I am trying to import the last MViT model from model zoo with pretrained weights

link: Model Zoo and Benchmarks — PyTorchVideo documentation

there are many examples for slow_r50/ slowfast_r50 but I could not find any for MViT

for example “x3d s” model can be loaded using the following code

model_name = 'x3d_s'
model = torch.hub.load('facebookresearch/pytorchvideo', model_name, pretrained=True)

I found this example here X3D | PyTorch

but how can I load MViT? I have tried using the combination of arch and depth as is the case with many models (though not all) but did not work.

plus what will be the input shape.

Could you please help? Thanks

Matias_Vasquez · April 4, 2022, 7:45am

You can guide yourself with the github repository to see how it is done for x3d_s and look how it should be done for the transformer.

This is for the x3d_s. As you can see, you can load any model from this file by using the names that are used in the def.

x3d_s: line 68
x3d_m: line 100
x3d_l: line 132
etc.

github.com

facebookresearch/pytorchvideo/blob/5e585415bde879756f60b8864b78e87c102c6abc/pytorchvideo/models/hub/x3d.py#L68

      
        
                return _x3d(
                    pretrained=pretrained,
                    progress=progress,
                    checkpoint_path=checkpoint_paths["x3d_xs"],
                    input_clip_length=4,
                    input_crop_size=160,
                    **kwargs,
                )
            
            

            
def x3d_s(
                pretrained: bool = False,
                progress: bool = True,
                **kwargs,
            ):
                """
                X3D-XS model architecture [1] trained on the Kinetics dataset.
                Model with pretrained weights has top1 accuracy of 73.33.
            
            
    [1] Christoph Feichtenhofer, "X3D: Expanding Architectures for
                Efficient Video Recognition." https://arxiv.org/abs/2004.04730

If you now go to the transformer file, you can look for the definitions

mvit_base_16x4: line 57
mvit_base_32x3: line 92
mvit_base_16: line 127
etc

github.com

facebookresearch/pytorchvideo/blob/main/pytorchvideo/models/hub/vision_transformers.py

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.

from typing import Any

import torch.nn as nn
from pytorchvideo.models.vision_transformers import (
    create_multiscale_vision_transformers,
)

from .utils import MODEL_ZOO_ROOT_DIR, hub_model_builder


checkpoint_paths = {
    "mvit_base_16x4": "{}/kinetics/MVIT_B_16x4.pyth".format(MODEL_ZOO_ROOT_DIR),
    "mvit_base_32x3": "{}/kinetics/MVIT_B_32x3_f294077834.pyth".format(
        MODEL_ZOO_ROOT_DIR
    ),
    "mvit_base_16": "{}/imagenet/MVIT_B_16_f292487636.pyth".format(MODEL_ZOO_ROOT_DIR),
}

This file has been truncated. show original

So now you only need to choose the one you want and do the same thing in your example code.

# For example
model_name = "mvit_base_32x3"
model = torch.hub.load('facebookresearch/pytorchvideo', model_name, pretrained=True)

szahan · April 7, 2022, 4:42am

Thank you so much for your detailed reply. It’s just not gonna solve this one but also will help me in future.