I’ve been using a CNN-LSTM network for video description but the thing is that if I change the CNN architecture from AlexNet to any other better architecture, the results become worse! I did experiments were I kept a few CNN layers frozen, left it end-to-end trainable or froze the whole thing except the last CNN layer but still AlexNet yielded better results.
Since I’m dealing with videos, I pass one video at a time but in order to go through the CNN I do the following:
bs, frames, channels, height, width = x.size() x = x.view(bs*frames, channels, height, width)
Can the fault be that for each video the frame number isn’t the same and the BatchNorm layers that the other architectures have mess things us?
Thanks in advance