Jointly training a deep convolutional network that feeds to a recurrent one

So I’ve seen something along those lines happen in a lot of works. In my case it has to do with video datasets but maybe there are other cases as well. In the past, when I was training a recurrent network I noticed some particularities, specifically regarding backprop through time. So what I had to do at the time using PyTorch was retain the computational graph up until the end of a certain clip size and then release it:

if idx == (clip.size()[0]-1): 

So now I want to feed a deep representation from a multi-layered CNN to an LSTM and I want to train the two simultaneously. I expect that retaining the graph will be a huge load for memory. What is the best way to jointly train the two?
I am also planning to experiment with an architecture that feeds back to a convolutional network after the recurrent one (C->R->C). I need a solution that will be versatile enough to manage this.