Conv2d vs conv3d


Suppose I am working with n RGB video frames with convolution kernels k x k. I can channelwise stack all the frames and use pytorch conv2d with kernel 3n x k x k or can simply use 3d convolutions with kernels n x 3 x k x k. So which should be used for highest accuracy? Theoretically, in both cases, the neural network should find either configuration comfortable as the parameter complexity and receptive field remains the same. Or is it that conv3d will be advantageous?

Thankyou very much.

I can’t say for sure which is advantageous. It may depend on specific tasks.

Logically, if we talk about video-based tasks, Conv2d may not observe temporal context and each frame will be processed independently (except they might interact during Batchnorm).
Conv3d would make the features from each frame interact and may learn temporal contexts. Empirically, Conv3d is shown to perform better. You may refer to resnet3D or I3D model design.

But as i said the frames are concatenated channel wise and thus the 2D kernel will be looking at all the frames and hence all frames interact as will be the case with conv3D.

I see. My bad that I didn’t understand it.

There is still a disadvantage with Conv2d in cases like below:

Let’s say we have 6 frames with RGB channels.
With Conv3d, we can emulate applying a conv kernel for every 3 frames to learn short-range temporal features. i.e., with in_channels=3 & kernel_size (3,5,5) for example. In this way, there is a possibility to learn motion features in a hierarchical way.

With Conv2d, I am not sure if we can emulate it. As you said, we can apply 18 x 5 x 5 kernel. But it may not be efficient as learning hierarchically.

Thanks for the understanding.