I’ve been testing the I3D and X3D_XS models from PytorchVideo to classify short video sequences.
This table and a manual inspection of the models show that X3D_XS has about 1/10 of the parameters of I3D (3M against 30M).
Based on this, I was expecting X3D_XS to have a much higher inference speed than I3D, also considering that X3D_XS accepts sequences with a minimum of 4 frames, whereas I3D only works if the sequence length is >= 8 frames. However, I found the opposite, i.e. I3D seems to run inference faster.
Here is an example of what I’ve been doing:
import numpy as np
import torch
def time_model(model, num_frames):
times = []
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
# dummy input data
input_data = \
(torch.tensor(np.zeros([4, 3, num_frames, 224, 224],dtype=np.float32)).half().to(device))
for _ in range(100):
start.record()
_ = model(input_data)
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
# check the mean inference time, remove first entries (no model warm-up)
return np.mean(times[10:])
def main():
i3d_model = (
torch.hub.load("facebookresearch/pytorchvideo", "i3d_r50", pretrained=True).half().to(device))
x3d_xs_model = (
torch.hub.load("facebookresearch/pytorchvideo", "x3d_xs", pretrained=True).half().to(device))
# timing I3D with 4 sequences of 8 frames
avg_time_i3d = time_model(i3d_model, 8)
# timing X3D_XS with 4 sequences of 5 frames
avg_time_x3dxs = time_model(x3d_xs_model, 5)
print(f"I3D: {avg_time_i3d}")
print(f"X3D_XS: {avg_time_x3dxs}")
if __name__ == "__main__":
if torch.cuda.is_available():
device = "cuda:0"
else:
device = "cpu"
device = torch.device(device)
main()
In my tests (using a rtx_8000 GPU), I see that x3d is even slightly slower than I3D:
I3D: 21.59845754835341
X3D_XS: 23.11600530412462
Am I doing something wrong, or are these results reasonable? Was my initial assumption on the expected inference speed for the models wrong?