Let’s say I have a video of 10 frames and I want to train a CNN-LSTM model with a Resnet backbone for example. During training, do I use one identical Resnet to train on the 10 seperate frames, OR 10 different Resnet one for every frame ??
Notes:
let’s suppose that I will use weights of a Resnet trained on Imagenet as Initialization.
The reason im asking this question is because I implemented a paper on time-series where I saw that the model’s parameters were T times the feature extractor parameters. But then, I said if we want to use the, let’s say, Resnet as a fixed feature extractor we use a single one instead of T ones !
Hi, I don’t think it’s possible and correct to train multiple models , it won’t even fit in the gpu memory.
You must train a single model where you extract the features for each frame an pass them through the lstm
As you want to extract visual features using the resnet(cnn) model, I think it will be better to keep only one resnet model. Because, using different resnet model for different frame doesn’t help to generalize the training process.