Hi, I’m trying to implement a network that feeds each frame of an arbitrary length video into a CNN to generate a sequence of feature vectors, then feeds the sequence into an LSTM.

My first idea seems a bit hacky so I thought I’d ask for an opinion or if someone thinks there is a better way

```
import torch
# Two 1 channel videos of 3 and 4 frames respectively
vid1 = torch.ones(1,3,3,3)
vid2 = 2* torch.ones(1,4,3,3)
# A toy CNN that takes the average every 2x2 square in a frame
cnn = torch.nn.Conv2d(1,1, kernel_size=(2,2), bias=False)
cnn.load_state_dict({'weight': torch.ones(1,1,2,2) * 1/4})
# Now I can run each frame of a video through the CNN, for example
cnn(vid1.transpose(0,1)).flatten(1)
# [out] tensor([[1., 1., 1., 1.],
# [1., 1., 1., 1.],
# [1., 1., 1., 1.]])
# but to run a batch say containing vid1 and vid2
cnn(torch.cat([vid1, vid2],dim=1).transpose(0,1)).flatten(1)
# [out] tensor([[1., 1., 1., 1.],
# [1., 1., 1., 1.],
# [1., 1., 1., 1.],
# [2., 2., 2., 2.],
# [2., 2., 2., 2.],
# [2., 2., 2., 2.],
# [2., 2., 2., 2.]])
# Now, with some difficulty,
# I could make a PackedSequence and feed that into an LSTM
```

This idea seems not very elegant so I’m open to suggestions. I’ve also thought about making the batch size 1, running one video at a time using the batch dimension for frames, but only step the optimizer every `N`

call to `loss.backward()`

to create an effective batch.

If anyone has any thoughts on this I would greatly appreciate it. Thanks

Here is a collab notebook continuing the implementation