The Kinetics400 VisionDataset seems very specific. For example, returning audio and requiring step_between_clips (basically forcing a sliding-window).
Quoting
To give an example, for 2 videos with 10 and 15 frames respectively, if
frames_per_clip=5
andstep_between_clips=5
, the dataset size will be (2 + 3) = 5, where the first two
elements will come from video 1, and the next three elements from video 2.
Note that we drop clips which do not have exactlyframes_per_clip
elements, so not all
frames in a video might be present.
The use case of taking two videos with 10, 15 frames respectively and returning (assuming clip-length of 5) a random temporal sample of frames [2,3,4,5,6] from video 1 and [11,12,13,14,15] from video 2 seems much more common (at least for i3d, r2+1d, etc models).
Are there any examples of converting the kinetics data-loader to work in these cases? To randomly sample one clip per video not do a sliding window?
I have found this example using ffmpeg-python would be great to use torchvision: