I3D and X3D_XS inference speed comparison

I’ve been testing the I3D and X3D_XS models from PytorchVideo to classify short video sequences.

This table and a manual inspection of the models show that X3D_XS has about 1/10 of the parameters of I3D (3M against 30M).
Based on this, I was expecting X3D_XS to have a much higher inference speed than I3D, also considering that X3D_XS accepts sequences with a minimum of 4 frames, whereas I3D only works if the sequence length is >= 8 frames. However, I found the opposite, i.e. I3D seems to run inference faster.

Here is an example of what I’ve been doing:

import numpy as np
import torch


def time_model(model, num_frames):

    times = []
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # dummy input data
    input_data = \
        (torch.tensor(np.zeros([4, 3, num_frames, 224, 224],dtype=np.float32)).half().to(device))

    for _ in range(100):
        start.record()
        _ = model(input_data)
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

    # check the mean inference time, remove first entries (no model warm-up)
    return np.mean(times[10:])


def main():

    i3d_model = (
        torch.hub.load("facebookresearch/pytorchvideo", "i3d_r50", pretrained=True).half().to(device))
    x3d_xs_model = (
        torch.hub.load("facebookresearch/pytorchvideo", "x3d_xs", pretrained=True).half().to(device))

    # timing I3D with 4 sequences of 8 frames
    avg_time_i3d = time_model(i3d_model, 8)
    # timing X3D_XS with 4 sequences of 5 frames
    avg_time_x3dxs = time_model(x3d_xs_model, 5)

    print(f"I3D: {avg_time_i3d}")
    print(f"X3D_XS: {avg_time_x3dxs}")


if __name__ == "__main__":

    if torch.cuda.is_available():
        device = "cuda:0"
    else:
        device = "cpu"
    device = torch.device(device)

    main()

In my tests (using a rtx_8000 GPU), I see that x3d is even slightly slower than I3D:

I3D: 21.59845754835341
X3D_XS: 23.11600530412462

Am I doing something wrong, or are these results reasonable? Was my initial assumption on the expected inference speed for the models wrong?