Video dataloader - recover video labels

Luca · December 12, 2019, 5:07pm

I need to extract some features from HMDB51/UCF101 dataset for a video classification task using a pretrained 3D CNN. From what I understand, the dataloaders available in Pytorch divide each video in a certain number of subclips (which I cannot set), separated by x frames (which I can set), and each subclip is made up of a set number of frames (which again I can set).

Let’s assume that I use the following code:

from torchvision.datasets import HMDB51

root = "<path_to_videos>"
annotation_path =  "<path_to_annotations>"
frames_per_clip = 32
step_between_clips=50
fold=1
num_workers = 12
norm_value = 255


normalize = T.Normalize(mean=[114.7748 / norm_value, 107.7354 / norm_value, 99.4750 / norm_value],
                        std=[0.22803, 0.22145, 0.216989])
height, width = 224, 224
transform_test = transforms.Compose([
    T.ToFloatTensorInZeroOne(),
    T.Resize((height, width)),
    normalize
])

dataset_test = HMDB51(root, annotation_path, frames_per_clip, step_between_clips=step_between_clips, 
                 fold=fold, train=False, transform=transform_test, num_workers=num_workers)

The above code produces 2528 datapoints for the test split 1 of HMDB51. I would like to average the prediction for the clips belonging to the same video, so that I can measure the accuracy of my classifier at the video level and not only at the clip one.

To do so I thought about using dataset_test.video_clips.get_clip_location() to get the video indeces, and then pick the labels in order according to the video index. The loader is not shuffled, so the video with index 0 gets the first label, the next one gets the second and so on.
In doing so I noticed that some videos are missing: Suppose that I try this:

for i in range(300,303):
    print(i, dataset_test.video_clips.get_clip_location(i))

This gives me

300 (137, 0)
301 (138, 0)
302 (141, 0)

Where are videos 139 and 140? Moreover, if I call dataset_test.video_clips.cumulative_sizes[137:142], I get [301, 302, 302, 302, 303]. Why do I get thos 3 numbers equal?
Also, what happens if a video is shorter than the frames_per_clip I selected? Will the video be ignored?
Thanks in advance for the help!