I have a video classification task for which I had extracted the frames and used Alexnet to do the classification on the frames. I believe my implementation was completely stupid since I was just doing image classification instead. Now I want to convert the Alexnet to a 3d convnet so that I can combine it with a lstm to do the classification.
Can you guys please suggest what changes I need to make to the existing Alexnet architecture?
I’m not sure how you would like to expand the shape of all parameters, which would be needed for a 3D model.
Would feeding each frame separately to the CNN and then forwarding these features to an RNN work instead?
I don’t think so. My advisor was suggesting a conv3d + lstm architecture.
OK, in that case you could try to swap all
2d layers for
3d ones, e.g.
This approach would reinitialize the model with random parameters.
If you are planning on using the pretrained parameters, you would have to think about a valid approach how to expand to the depth dimension.
Thanks @ptrblck ! Is there some approach you could suggest to use the pretrained weights from 2D? I am just changing the 2D kernels to 3D. So can we map the weights along the depth dimension as a starting point?
Thanks again for the help!
You could unsqueeze the pretrained kernels to create a new depth dimension and just repeat the values:
conv = nn.Conv2d(6, 12, 3)
weight = conv.weight
> torch.Size([12, 6, 3, 3])
weight = weight.unsqueeze(2).repeat(1, 1, weight.size(2), 1, 1)
> torch.Size([12, 6, 3, 3, 3])
conv3d = nn.Conv3d(6, 12, 3)
conv3d.weight = nn.Parameter(weight)
It could be ambiguous, if the 2D kernels are not square kernels, but this depends on the pretrained model of course.