Working with videos of arbitrary length

Hi, I’m trying to implement a network that feeds each frame of an arbitrary length video into a CNN to generate a sequence of feature vectors, then feeds the sequence into an LSTM.

My first idea seems a bit hacky so I thought I’d ask for an opinion or if someone thinks there is a better way

import torch

# Two 1 channel videos of 3 and 4 frames respectively
vid1 = torch.ones(1,3,3,3)
vid2 = 2* torch.ones(1,4,3,3)

# A toy CNN that takes the average every 2x2 square in a frame
cnn = torch.nn.Conv2d(1,1, kernel_size=(2,2), bias=False)
cnn.load_state_dict({'weight': torch.ones(1,1,2,2) * 1/4})

# Now I can run each frame of a video through the CNN, for example
cnn(vid1.transpose(0,1)).flatten(1)
# [out] tensor([[1., 1., 1., 1.],
#               [1., 1., 1., 1.],
#               [1., 1., 1., 1.]])

# but to run a batch say containing vid1 and vid2
cnn(torch.cat([vid1, vid2],dim=1).transpose(0,1)).flatten(1)
# [out] tensor([[1., 1., 1., 1.],
#               [1., 1., 1., 1.],
#               [1., 1., 1., 1.],
#               [2., 2., 2., 2.],
#               [2., 2., 2., 2.],
#               [2., 2., 2., 2.],
#               [2., 2., 2., 2.]])

# Now, with some difficulty, 
# I could make a PackedSequence and feed that into an LSTM

This idea seems not very elegant so I’m open to suggestions. I’ve also thought about making the batch size 1, running one video at a time using the batch dimension for frames, but only step the optimizer every N call to loss.backward() to create an effective batch.

If anyone has any thoughts on this I would greatly appreciate it. Thanks

Here is a collab notebook continuing the implementation

I found a very similar post here that answers my questions

1 Like