Proper way to use 2D CNN with video input

Hello,i was wondering if the following is the right way to process a video with 2D CNN:

batch_size, timesteps, C, H, W = x.size()
c_in = x.view(batch_size * timesteps, C, H, W)
c_out = self.cnn(c_in)

This seems to work perfect with AlexNet,but when i use a bigger network(say googlenet or resnet) it’s performance downgraded.In addition limited data is not the matter, as there are papers using such big networks with the same Dataset.The only thing to blame from my pov could be the fact that Alexnet doesn’t have BatchNorm, whilst the rest of them do.
(My batch_size = 1 and i don’t use temporal padding)

1 Like

Are you processing c_out somehow afterwards?
Note that different models might use different names for the “last” layer, e.g. model.fc instead of model.classifier.
A batch size of 1 with batch norm layers is usually not a good idea. You could try to replace the batch norm layers by e.g. instance norm, change the momentum argument, remove them, etc.

1 Like

Thank you for your reply,
c_out is then processed by an RNN (CNN+RNN+Log_Softmax and is trained with CTC loss).As far as different names are concerned this is definitely not the problem.I am going to try replacing batch norm layers and i 'll let you know.
Thanks again.