Conv2D vs Conv3D on volumes (with and without channels)

I’m trying to wrap my head around the Conv2D and Conv3D operations for volumes. I basically have two datasets containing medical volumes. One dataset is 4D, e.g. CT data with some time resolution. And the other dataset is 3D, i.e. independent CT volumes. For both these datasets, we have only 1 channel, so no color channels. An example shape from the 4D dataset can for example be (20,155,300,300) (i.e. 20 time frames for one patient), and for the 3D dataset for example (155,300,300), we’re excluding the batch dimension for now.

The way I want to use Conv2D and Conv3D operations now, is for the 3D dataset, I would like to use a Conv2D and have the slices in the channel dimensions. The output would take into consideration all slices since the filter will also have the same amount of channels as the input data. And for the 4D dataset, I would like to use Conv3D to take into account e.g. two consecutive time frames. So the shape would be (2,155,300,300). To me, this makes sense.

However recently, I stumbled upon this answer with some similar discussion. The question was asking something similar of whether to use Conv2D or Conv3D for MRI volumes. The answer (from ptrblck) was this:

" I’m not sure what would work best for your use case, but let’s have a look at the differences between both approaches.*

I assume your MRI data has a spatial size of 256x256 and contains 125 slices.
If you’re using nn.Conv2d I would suggest to use the slices as the “channels”.
This would mean that each kernel in your conv layer will have the defined spatial size, e.g. 3, and will use all channels. The kernel shape would be [nb_kernels, 125, 3, 3]. The output will thus be calculated by using a dot product of the small 3x3 window and all slices.

On the other hand, if you are using nn.Conv3d, you could permute the slices to the “depth” dimension and add a single dimension for the channels. This would mean your kernel is now a volume with the shape e.g. [3x3x3], such that 3 neighboring slices will be used to produce the output for the current position.

*What would you like your model to learn in the MRI images? I think in a segmentation case the second approach could work better, but that’s just my guess."

I feel like I understand the answer regarding the Conv2D part. And that I have written something similar above. However, the 3D convolution using a single volume is what confuses me. I get the idea of moving the slices to the depth dimension, and then adding a dimension for the channels. So the shape for the 3D data would thus be (1,155,300,300), where the 1 here is the new channel dimension. However, I’m having trouble understanding why we would do this. Given the answer above, it seems that doing it this way, we would not consider using all the slices simultaneously as we did with Conv2D. So we would instead use, say, only 3 slices for the output (depending on kernel size). Is this difference here, or is there something else I’m missing?

I want to make clear that this is regarding volumes where we don’t have any color channels. For a volume with color channels, where each slice is RGB (e.g. has a shape of (3,155,300,300)), then a Conv3D operation makes sense to me (I think).

My main goal is doing segmentation btw.

So the main question is: Is there any point doing a Conv3D operation on a single volume with only 1 channel?

Hopefully it’s clear what I mean, let me know otherwise!