What should be the input shape for 3D CNN on a sequence of images?

Conv3d — PyTorch 1.7.1 documentation Describes that the input to do convolution on 3D CNN is (N,Cin,D,H,W). Imagine if I have a sequence of images which I want to pass to 3D CNN. Am I right that:

  1. N → number of sequences (mini batch)
  2. Cin → number of channels (3 for rgb)
  3. D → Number of images in a sequence
  4. H → Height of one image in the sequence
  5. W → Width of one image in the sequence

The reason why I am asking is that when I stack image tensors: a = torch.stack([img1, img2, img3, img4, img5]) I get shape of a torch.Size([5, 3, 396, 247]), so is it compulsory to reshape my tensor to torch.Size([3, 5, 396, 247]) so that number of channels would go first or it does not matter inside the Dataloader?

Note that Dataloader would add one more dimension automatically which would correspond to N.

You could use the depth dimension to stack a sequence of images. Note however, that the kernels in each conv layer would also have a depth dimension and would thus convolve through it as well (unless you set the kernel size to 1 for the depth).
Often nn.Conv3d is used on a “stack” of images, such as medical CT scans using slices for the depth dimension.
Depending on your use case, the same approach might also work for your “sequences”.

1 Like