Which is better 3D Conv or CNN+LSTM

I am working on project where I am trying to process video data but I would like to know which is better to use either 3D Conv or CNN+LSTM?

If u have enough amount of data and a good gpu, I would suggest using 3d conv+ LSTM.

Do you mean by 2d conv + LSTM? Please correct me if I am wrong
Can I not directly use only 3D conv to extract features from videos and use it for further classification?
I would like to know apart from mentioned reasons at what level does two of them differ? (like better feature extraction at both spatial and temporal level or anything else)

No, I mean 3d conv with lstm.
Yes, u can use 3d conv to extract features directly. Conv is able to recognize spatial patterns; for example conv for pictures can recognize edges and curves and etc. LSTM is good at recognizing temporal information. A LSTM can remember information before the current time steps; for example, LSTM is used in machine translation because we can’t words later in the sentence can have a relation with words in the beginning of the sentence.

okay thanks got it :slight_smile: