Hi Pytorch community.
I wanted to implement sequence classification of videos, so far I have been using a pretrained feature extractor to get a d-dimensional vector representation of a frame for all frames and pass this to an LSTM.
All this while I have been using sequence length = number of frames in the video and batch size of 1.
Since all my videos have varying number of sequence lengths I understand that I will have to pad them in some way like in the pad_sequence() function in utils.rnn
I understand that I will have to arrange the data as seq_length x batch_size x (…) and doing so would have a trade off between generalizability and memory
My question is which would be better - padding with zero vectors or randomly picking random frames and repeating it until total length equals max_length ?