CNN layer in the top of LSTM

Hello I have a problem to think about LSTM. I have a video classification task problem. In the most of people works, they will use CNN to extract the feature, let say for example VGG16. The extracted features will be feed forward through LSTM layer and say that we need to concern about 10 frames, so it will output 1 prediction after 10 frames. So the network will works as CNN extract 10 frames(should be 4096 x 10), concate the feature and feed into LSTM. If we follow this scenario, we must train CNN separately first, but most of people will using pretrained model. Then we train LSTM using the extracted feature before. This will leads into problem that I can not beckpropagate to my previous CNN during my LSTM training.

My question is how can I do in pytorch where I put CNN followed by LSTM, and during the training of LSTM I also can change the weight/ backprop to my CNN?

1 Like

Given the setup you described, I believe all you need to do is to load a pre-trained VGG network using torchvision.models.vgg and feed the output features to the RNN (LSTM, etc.). Autograd will take care of back-propagating the error through the VGG network. Unless I’m completely missing something, it’s really as simple as that.

@ndronen Then how about the sequence length of LSTM?, do I need For loops for every image(in this case 10 frames) then input to LSTM. If I am using for loops, how about the backpropagation?. Because most of cases people just train in the LSTM side (Correct me if I am wrong), the CNN just feature generator and fixed with pretrained weight.

Yes, in PyTorch, because of autograd, a for-loop is completely compatible with backpropagation. Autograd builds the computational graph dynamically, recording operations and their predecessors as they occur, so backpropagation will just work. I included a link to an autograd tutorial in my previous response.

1 Like

@ndronen @smth Do you think it should be like this one? suppose that I want to update the weight of CNN during LSTM training.

Because some peoples said it should be like this (for example done in keras).
image

Is there something like TimeDistributed in Pytorch?, or we shouldn’t do it anymore because figure 1 is true

@ndronen @smth If I am doing like in the figure above, how can I am taking every image that need to be processed (in this example [wxhx3] x10) ? , and the most important thing is how can I append the tensor?
Is it will backprop once times or ten times for the CNN when training LSTM?

did you ever get this working? if so can you share your code? I want to do a similar thing.

I am not pretty sure whether I implement it right or wrong but you can add your CNN model before the LSTM inforward function. However it will be very slow while training. Therefore I am not doing that because seems like train that separately give not much different accuracy and faster. Make sure you slice tensor the do CNN on it after that concate to feed to LSTM. I am still trying to find my previous code but seems like it has been deleted while I formatted my disk. Sorry.

there is no time distributed layer in pytorch but i found from the pytorch forum that someone has created a time distributed layer, a CNN layer and a LSTM layer could not be connected directly, but pytorch has something called Conv3D layers (C3D) which could be used for video classification.

Sorry what do you mean by this: “Make sure you slice tensor the do CNN on it after that concate to feed to LSTM.”? I can not understand it. Do you mean slice the CNN’s output then concatenate it and feed it to LSTM?