Confusion with CNN-LSTM implementation

ilyes · August 26, 2019, 9:44am

Hello,

Let’s say I have a video of 10 frames and I want to train a CNN-LSTM model with a Resnet backbone for example. During training, do I use one identical Resnet to train on the 10 seperate frames, OR 10 different Resnet one for every frame ??
Notes:

let’s suppose that I will use weights of a Resnet trained on Imagenet as Initialization.
The reason im asking this question is because I implemented a paper on time-series where I saw that the model’s parameters were T times the feature extractor parameters. But then, I said if we want to use the, let’s say, Resnet as a fixed feature extractor we use a single one instead of T ones !

Thank you

111137 · August 26, 2019, 9:56am

Hi,

Identical ResNet
Then you can make instance like this;

resnet1 = model()
resnet2 = model()
resnet3 = model()

Multi-Model
Then you can make multiple different models;

resnet1 = model1()
resnet2 = model2()
resnet3 = model3()

IliasPap · August 26, 2019, 1:51pm

Hi, I don’t think it’s possible and correct to train multiple models , it won’t even fit in the gpu memory.
You must train a single model where you extract the features for each frame an pass them through the lstm

111137 · August 27, 2019, 5:17am

Let us see Encoder-Decoder model consisting of encoder model and decoder model in the code.

akashs · December 2, 2019, 4:44pm

As you want to extract visual features using the resnet(cnn) model, I think it will be better to keep only one resnet model. Because, using different resnet model for different frame doesn’t help to generalize the training process.