Conv3d for image time-series forecasting


I have an image time series forecasting which I think can be done via conv3d. However I am confused about the input shape:

In the documentation I saw input shape to be: (N,C,D,H,W).

  • For my case H and W are both 256

  • C is the number of channels

  • N is the batch size (number of samples per batch)

However, I am confused about the parameter D. What does it stands for? And how can I define it in my case?

Thanks in advance!

D refers to the “depth” of the volume you are using, since nn.Conv3d expect volumes as inputs instead of 2D planes/images. The filter kernel of the Conv3d layer will also have an additional depth dimension and the convolution will be applied on all 3 dimensions (i.e. the filter is moving in all 3 dims).

So in 3d convolution, D is always 3? Or what should I think of while deciding what D should be?

No, the depth can be any value as the other height and width dimensions.
Think about the depth as e.g. the stack of medical images (e.g. CT scans).
The height and width defines the “image” dimensions of each CT slice while the depth is defining the number of slices. In this case the volume would contain “'voxels” instead of pixels, since you are now using a volume to represent the scan.

Ahaaa, okay so if I have 15 images depth is 15, and if they are defined as R,G,B channel is = 3 ?

Yes, each “volume” (or stack of images) can have multiple channels. For a “color volume” the channel dimension would be 3.