How do train a CNN/LSTM combination with long sequences?

Tough question to frame in one line but here is the context:

I have an LSTM on varying length image sequences. A CNN generates the inputs to the LSTM. I want to train the CNN on the fly, rather than use it just as an embeddings generator.

To do this I take my (batch_size, seq_length, *image_dims) and view it as one big batch of images like this (batch_size*seq_length, *image_dims). Then after it comes out of the CNN I unpack it for feeding to the LSTM.

This works just fine apart from the fact that some of the sequences are too long to fit in memory for the CNN (because I’m also storing gradients for backprop). I’d love to get some feedback on how to solve it. Here’s what I’ve been thinking about:

  • Break the “big batch” into sub batches. Do back-prop on the LSTM alone, then backprop on the sub-batches. Seems like a lot of work and overcomplicating things. Also, I’m not sure it would work as I still need to store gradients for everything.

  • Randomly select a maximum of 256 samples from the “big batch”, pick those out and run them through. All the rest runs through CNN with no grad. That way the CNN is not learning for every single image in the sequence at once, but at least I can get the inputs through to the LSTM without killing my memory.

Any ideas or guidance would be appreciated!

@Alexander_Soare , Did you figure out what strategy worked for you? Could you share some insights? Also what is your GPU memory? Thanks!

@ekmungi I think I was working with an RTX 3090 with 24 GB of memory. I don’t think I ever solved this problem as I moved onto using a transformer instead.