Video frames as conv2d channels or 1 channel conv3d

Hi, I have a greyscale video input that I’m trying to do a classification problem on. There are ~62 frames in the video and each individual frame is 64x64 pixels. I was wondering if there is any difference between using a conv2d with 62 channels and using a conv3d with 62 depth and 1 channel. Conv3d seems to make more sense in context but since it’s just 1 channel I’m wondering if it actually makes any difference in practice?

Thanks in advance for any help

Well it does.
The conv2d with 64 channels takes into account all the frames when computing the outputs.
Meanwhile a conv3d with temporal kernel=1 is like applying independenty 2d convolutions to each frame. Therefore their features will depend in each frame only.

1 Like