LSTM batch training over one long video session

zerone · June 3, 2020, 8:46pm

Hello Everyone! As the title goes, I am having trouble in coming up with an LSTM model for training over long video sessions. I have, say K training videos, each of duration 1 hour or so. For each video, I am extracting frames at a certain FPS (say 10 frames per second) and then I obtain their corresponding 512 dimensional CNN codes (from a pre-trained CNN). So, considering for one such video, say I have 36000 frames, thus I have a feature matrix of size 36000 x 512.

Now, as it’s obvious… each frame is related to the previous frame. I wish to train my LSTM one video at a time. That is… my one epoch will consist of iterating over each of the K videos, and for each video I will pass its frames’ features to the LSTM. I intend to pass the frames’ features by constructing a 3D tensor of shape [batch_size, sequence_size, 512] and then obtain their corresponding labels (2D tensor of shape: [batch_size x sequence_size, 1] (i.e. a many to many model). To train the model… there are two possible ways I have identified.

Method 1: Set the batch_size = 1 i.e. the tensor is of shape [1, sequence_size, 512] and pass it to the LSTM initialized with hidden state (including the cell state) = None. Now, as the LSTM unfolds over each frame in the sequence_size, the injection of hidden state from the previous time-step ensures the continuity of the learning. In addition, I can simply pass the hidden state obtained at the end of batch (note that batch_size is 1) to the next batch, thus ensuring that learning over the next batch is done in consideration with the previous batch. You can think of this as… a set of… say 200 frames is broken into two batches each of sequence size 100 frames, and learning over 200 frames is equivalent to learning over one batch followed by the another.

Method 2: Set the batch_size = N (where N > 1, say N=10), i.e. the tensor is of shape [10, sequence_size, 512]. Now I need to maintain the same continuity of learning over each element in the batch (note that each element in the batch is of shape [sequence_size, 512]). But since batch processing is parallelized, I am assuming that each of the element in the batch will have the same initial hidden state of None and thus learning won’t be continuous. It is as if, each element of the batch is considered independent of each other. But in my case it isn’t. The first frame of each element in the batch is related to the last frame of the previous batch.

My question: How do I ensure that hidden state obtained at the end of processing the first element of batch is passed to the next element of the batch? If I find a method for it, then I can simply pass the hidden state obtained at the end of each batch to the next batch and a continuous learning would be maintained. If there’s no way… then I believe I will have to work with the slower Method 1 to ensure a continuous learning. Please let me know your thoughts. Thanks!