AlexNet vs ResNet efficiency as feature extractor

I’ve been using a CNN-LSTM network for video description but the thing is that if I change the CNN architecture from AlexNet to any other better architecture, the results become worse! I did experiments were I kept a few CNN layers frozen, left it end-to-end trainable or froze the whole thing except the last CNN layer but still AlexNet yielded better results.
Since I’m dealing with videos, I pass one video at a time but in order to go through the CNN I do the following:

bs, frames, channels, height, width = x.size()
x = x.view(bs*frames, channels, height, width)

Can the fault be that for each video the frame number isn’t the same and the BatchNorm layers that the other architectures have mess things us?
Thanks in advance :slight_smile:

1 Like

In the context of object tracking, until recently, the best feature extractor for Siamese Networks was AlexNet. This paper sheds light on the reasons why standard ResNets worsen feature extractor performance, and they propose a solution to this problem.

I’m not sure if it will solve your case, but it is worth taking a closer look.