Doubt about using conv2d or conv3d


I have a doubt about using conv2d or conv3d on my problem. I have an array of shape (M, M, N) where each image is formed by M x M pixels and we have N of those.

My question is:

  • Should I use 2d conv where the channels are the N value (i.e. input shape (batch size, N, M, M)

  • 3d conv net where we start with one channel on a 3D image (i.e. input shape (batch size, 1, M, M, N)

The N images may have shared statistics. Now, I have tried both and the 3D conv net seems to give better results but I am not sure how should I interpret that and maybe it is due to an hyperparameter issue.


1 Like

The difference between nn.Conv2d and nn.Conv3d would be how the additional N dimension is handled in your use case.
The nn.Conv2d layer would interpret N as the channel dimension and each kernel would thus use all channels in the default setup. The sliding windows would be applied in the spatial MxM dimensions.
On the other hand, the nn.Conv3d layer would use the “sliding cube” in the NxMxM dimensions and use a single input channel.

I’m not familiar with your use case and don’t know which approach would make more sense. Since the 3D layer seems to give better results, you could stick to it.