How to do convolution on tube tensors (3D conv)()

I am trying to do convolution on frames videos (like tube of videos).
So im reading video frams and make them to have the shape of NxCinxDxHxW, where Cin = 3 (channel size), and W,H= dimension (lets say they are equal) and D is 1. and N is batch size, lets say 1 for simipilicity.

then i concatenate them, so my final output is having the size of NxCxDxHxW, where D is the number of frames.

now i want to do 3d convolution in a way that i do convolution along the frames, like i have the input of NxCxDxHxW and kernel CxDxKxK.

This is an example,

m = nn.Conv3d(3, 30, (6,3,3), stride=1,padding=(0, 1, 1))
input = torch.randn(1,3 , 6, 10, 10)
output = m(input)
torch.Size([1, 30, 3, 10, 10])

I dont get the concept of padding along D, how does it happen?

Can you please tell me how should i do it?

Check out the Conv3D documentation -

According to that link the proper input size is (N, C, D, H, W)

1 Like

what does it mean to pad along D dimension?

Where do you see that?

in the conv3d documentation, we can pad for 3 dimensions like (1,1,1), the first one is pad for D, the last two are for H and W.
I cannot get what does it mean to pad for D

That means to add additional frames to the sequence

1 Like

In N,C,D,H,W… What is D?
and why the channels is in the second dimension, not the last dimension?

D corresponds to the “depth” of the volume, which is the additional dimension spanning the volume besides the height and width.

PyTorch uses the (user-facing) channels-first memory layout by default, so the channel dimension is placed in dim1 (also for 2D layers).

1 Like

Thank you, In my specific case, the D(depth) dimension can be regarded as time dimension. My input data is video.