Help understanding 3D Convolution

I have a image composed of M channels of H height and W width and I want to apply a channel-wise convolution, so I thought of using the Conv3d class. Currently, my image has shape (M, H, W)

But in the docs they specify that the input must be (N, Cin, D, H, W). What I know is that N is the minibatch size, H is the height and W is the width. But I am getting confused about Cin and D.

From what I understand, Cin is the number of channels of the image, but what does D mean? To do the convolution on my image should I pass it like (N, 1, M, H, W) or like (N, M, 1, H, W)?


3D convolutions are supposed to deal with temporal structures, in short, a video.
D is in this case is something amount of images

1 Like

Thanks man! Seems like I’ll be using Conv2D then.

@Manuel_Alejandro_Dia & @JuanFMontesinos
There is a great reference for understanding Different kinds of Convolution Operators (3D Convolution, Spatially separable convolution, depthwise separable convolution, etc.):