Kinetics400 Dataset

The Kinetics400 VisionDataset seems very specific. For example, returning audio and requiring step_between_clips (basically forcing a sliding-window).
Quoting

To give an example, for 2 videos with 10 and 15 frames respectively, if frames_per_clip=5
and step_between_clips=5, the dataset size will be (2 + 3) = 5, where the first two
elements will come from video 1, and the next three elements from video 2.
Note that we drop clips which do not have exactly frames_per_clip elements, so not all
frames in a video might be present.

The use case of taking two videos with 10, 15 frames respectively and returning (assuming clip-length of 5) a random temporal sample of frames [2,3,4,5,6] from video 1 and [11,12,13,14,15] from video 2 seems much more common (at least for i3d, r2+1d, etc models).

Are there any examples of converting the kinetics data-loader to work in these cases? To randomly sample one clip per video not do a sliding window?

I have found this example using ffmpeg-python would be great to use torchvision: