How to do convolution on tube tensors (3D conv)()

I am trying to do convolution on frames videos (like tube of videos).
So im reading video frams and make them to have the shape of NxCinxDxHxW, where Cin = 3 (channel size), and W,H= dimension (lets say they are equal) and D is 1. and N is batch size, lets say 1 for simipilicity.

then i concatenate them, so my final output is having the size of NxCxDxHxW, where D is the number of frames.

now i want to do 3d convolution in a way that i do convolution along the frames, like i have the input of NxCxDxHxW and kernel CxDxKxK.

This is an example,

m = nn.Conv3d(3, 30, (6,3,3), stride=1,padding=(0, 1, 1))
input = torch.randn(1,3 , 6, 10, 10)
output = m(input)
torch.Size([1, 30, 3, 10, 10])

I dont get the concept of padding along D, how does it happen?

Can you please tell me how should i do it?

Check out the Conv3D documentation -

According to that link the proper input size is (N, C, D, H, W)

what does it mean to pad along D dimension?

Where do you see that?

in the conv3d documentation, we can pad for 3 dimensions like (1,1,1), the first one is pad for D, the last two are for H and W.
I cannot get what does it mean to pad for D

That means to add additional frames to the sequence

1 Like