Let’s say I have a video of 10 frames and I want to train a CNN-LSTM model with a Resnet backbone for example. During training, do I use one identical Resnet to train on the 10 seperate frames, OR 10 different Resnet one for every frame ??
- let’s suppose that I will use weights of a Resnet trained on Imagenet as Initialization.
- The reason im asking this question is because I implemented a paper on time-series where I saw that the model’s parameters were T times the feature extractor parameters. But then, I said if we want to use the, let’s say, Resnet as a fixed feature extractor we use a single one instead of T ones !