I am trying to do convolution on frames videos (like tube of videos).
So im reading video frams and make them to have the shape of NxCinxDxHxW, where Cin = 3 (channel size), and W,H= dimension (lets say they are equal) and D is 1. and N is batch size, lets say 1 for simipilicity.

then i concatenate them, so my final output is having the size of NxCxDxHxW, where D is the number of frames.

now i want to do 3d convolution in a way that i do convolution along the frames, like i have the input of NxCxDxHxW and kernel CxDxKxK.

in the conv3d documentation, we can pad for 3 dimensions like (1,1,1), the first one is pad for D, the last two are for H and W.
I cannot get what does it mean to pad for D