What is the usual way to sample frames for video classification?

Zhang_Chi · May 17, 2018, 3:04am

Hi,
It’s my first time touching this area. I have a naive question about video classification
for a 3d conv model, one input data is like CFHH,e.g. 316256256, which is a 16 frame clip from a video. My question is how to sample the 16 frames for training? my understanding is to sample frames uniformly. for example, for a video with 160 frames, I can sample one frame every 10 frames, and for a 1600 frame video, every 100 frames. However, this means other 9/99 frames are wasted. I could come up with some ways to make use of other frames, such as randomly sampling a frame within the 10/100, frames.
I dont know what is the usual way to do it. Thanks.

fmassa · May 17, 2018, 11:41am

One alternative would be to have a look at https://github.com/NVIDIA/nvvl , which provides a PyTorch dataloader that uses the GPU for decoding the videos.

thecho7 · May 17, 2018, 12:46pm

One of the possible way is usage of dense and sparse sampling.
It means collect densely from the high score part and, in opposite, collect sparsely from the low score part.
However, this method only can be applied in training.
In CVPR (perhap '15 or '16), MS utilized both single frame and sequential frames to detect the highlight in a video. I recommend you look up that paper.