Confusion with CNN-LSTM implementation


Let’s say I have a video of 10 frames and I want to train a CNN-LSTM model with a Resnet backbone for example. During training, do I use one identical Resnet to train on the 10 seperate frames, OR 10 different Resnet one for every frame ??

  • let’s suppose that I will use weights of a Resnet trained on Imagenet as Initialization.
  • The reason im asking this question is because I implemented a paper on time-series where I saw that the model’s parameters were T times the feature extractor parameters. But then, I said if we want to use the, let’s say, Resnet as a fixed feature extractor we use a single one instead of T ones !

Thank you


  • Identical ResNet
    Then you can make instance like this;
resnet1 = model()
resnet2 = model()
resnet3 = model()
  • Multi-Model
    Then you can make multiple different models;
resnet1 = model1()
resnet2 = model2()
resnet3 = model3()

Hi, I don’t think it’s possible and correct to train multiple models , it won’t even fit in the gpu memory.
You must train a single model where you extract the features for each frame an pass them through the lstm


Let us see Encoder-Decoder model consisting of encoder model and decoder model in the code.

As you want to extract visual features using the resnet(cnn) model, I think it will be better to keep only one resnet model. Because, using different resnet model for different frame doesn’t help to generalize the training process.