Hello,i was wondering if the following is the right way to process a video with 2D CNN:
self.cnn=models.alexnet(pretrained=True)
self.cnn.classifier[-1]=Identity()
...
batch_size, timesteps, C, H, W = x.size()
c_in = x.view(batch_size * timesteps, C, H, W)
c_out = self.cnn(c_in)
This seems to work perfect with AlexNet,but when i use a bigger network(say googlenet or resnet) it’s performance downgraded.In addition limited data is not the matter, as there are papers using such big networks with the same Dataset.The only thing to blame from my pov could be the fact that Alexnet doesn’t have BatchNorm, whilst the rest of them do.
(My batch_size = 1 and i don’t use temporal padding)