Hi!
I have a video classification task for which I had extracted the frames and used Alexnet to do the classification on the frames. I believe my implementation was completely stupid since I was just doing image classification instead. Now I want to convert the Alexnet to a 3d convnet so that I can combine it with a lstm to do the classification.
Can you guys please suggest what changes I need to make to the existing Alexnet architecture?
Thanks!
I’m not sure how you would like to expand the shape of all parameters, which would be needed for a 3D model.
Would feeding each frame separately to the CNN and then forwarding these features to an RNN work instead?
I don’t think so. My advisor was suggesting a conv3d + lstm architecture.
OK, in that case you could try to swap all 2d
layers for 3d
ones, e.g. nn.Conv2d
for nn.Conv3d
.
This approach would reinitialize the model with random parameters.
If you are planning on using the pretrained parameters, you would have to think about a valid approach how to expand to the depth dimension.
Thanks @ptrblck ! Is there some approach you could suggest to use the pretrained weights from 2D? I am just changing the 2D kernels to 3D. So can we map the weights along the depth dimension as a starting point?
Thanks again for the help!
You could unsqueeze the pretrained kernels to create a new depth dimension and just repeat the values:
conv = nn.Conv2d(6, 12, 3)
weight = conv.weight
print(weight.shape)
> torch.Size([12, 6, 3, 3])
weight = weight.unsqueeze(2).repeat(1, 1, weight.size(2), 1, 1)
print(weight.shape)
> torch.Size([12, 6, 3, 3, 3])
conv3d = nn.Conv3d(6, 12, 3)
with torch.no_grad():
conv3d.weight = nn.Parameter(weight)
It could be ambiguous, if the 2D kernels are not square kernels, but this depends on the pretrained model of course.