Conv2D vs Conv3D on volumes (with and without channels)

I’m trying to wrap my head around the Conv2D and Conv3D operations for volumes. I basically have two datasets containing medical volumes. One dataset is 4D, e.g. CT data with some time resolution. And the other dataset is 3D, i.e. independent CT volumes. For both these datasets, we have only 1 channel, so no color channels. An example shape from the 4D dataset can for example be (20,155,300,300) (i.e. 20 time frames for one patient), and for the 3D dataset for example (155,300,300), we’re excluding the batch dimension for now.

The way I want to use Conv2D and Conv3D operations now, is for the 3D dataset, I would like to use a Conv2D and have the slices in the channel dimensions. The output would take into consideration all slices since the filter will also have the same amount of channels as the input data. And for the 4D dataset, I would like to use Conv3D to take into account e.g. two consecutive time frames. So the shape would be (2,155,300,300). To me, this makes sense.

However recently, I stumbled upon this answer with some similar discussion. The question was asking something similar of whether to use Conv2D or Conv3D for MRI volumes. The answer (from ptrblck) was this:

" I’m not sure what would work best for your use case, but let’s have a look at the differences between both approaches.*

I assume your MRI data has a spatial size of 256x256 and contains 125 slices.
If you’re using nn.Conv2d I would suggest to use the slices as the “channels”.
This would mean that each kernel in your conv layer will have the defined spatial size, e.g. 3, and will use all channels. The kernel shape would be [nb_kernels, 125, 3, 3]. The output will thus be calculated by using a dot product of the small 3x3 window and all slices.

On the other hand, if you are using nn.Conv3d, you could permute the slices to the “depth” dimension and add a single dimension for the channels. This would mean your kernel is now a volume with the shape e.g. [3x3x3], such that 3 neighboring slices will be used to produce the output for the current position.

*What would you like your model to learn in the MRI images? I think in a segmentation case the second approach could work better, but that’s just my guess."

I feel like I understand the answer regarding the Conv2D part. And that I have written something similar above. However, the 3D convolution using a single volume is what confuses me. I get the idea of moving the slices to the depth dimension, and then adding a dimension for the channels. So the shape for the 3D data would thus be (1,155,300,300), where the 1 here is the new channel dimension. However, I’m having trouble understanding why we would do this. Given the answer above, it seems that doing it this way, we would not consider using all the slices simultaneously as we did with Conv2D. So we would instead use, say, only 3 slices for the output (depending on kernel size). Is this difference here, or is there something else I’m missing?

I want to make clear that this is regarding volumes where we don’t have any color channels. For a volume with color channels, where each slice is RGB (e.g. has a shape of (3,155,300,300)), then a Conv3D operation makes sense to me (I think).

My main goal is doing segmentation btw.

So the main question is: Is there any point doing a Conv3D operation on a single volume with only 1 channel?

Hopefully it’s clear what I mean, let me know otherwise!

I’ve been trying to figure out a similar question myself, and your question actually broke it open for me. Here’s what I have come to understand:

When using Conv2D, the expectation is that all the input channels will be included in each convolution. For your 3D dataset, your in_channels would necessarily be 155. So your kernels will be of shape (155, kernel_size[0], kernel_size[1]) and you will have out_channels number of kernels to prouduce that many outputs. If you consider only one output, the pixel at index [0, 0, 0] will be a result of the the pixels within the window when centered at locations [1, 1] for all 155 slices. This is why conv2d is named as such; the kernel cannot move in the third dimension. So it does in fact perform a 3D convolution, but it can’t actually move in the third dimension because it’s as deep as the input, by default. Conv3D on the other hand, is not limited to the same depth. In other words, the kernel can move in the 3rd dimension (though, it does not have to). Suppose you suspect there are meaningful spatial relationships in the 3rd dimension that span a depth of 3 slices, but you are skeptical that they extend beyond that, then you wouldn’t want to use Conv2D, because it will span the entire depth for every kernel. Using Conv3D, you can define the kernel so that it will only center over 3 slices at a time. Thus, your output at index [0, 0, 0] will come from the kernel when it is centered at [1, 1, 1] for only the first 3 slices. The kernel will drag across the entire 3-slice volume, then shift down one, and repeat, building up a new image volume, where each layer is some representation of the voxels in that window, but not the entire volume of slices in the (x, y) window.

To really drive home the difference for me, it was helpful to think about your 4D set. Suppose you thought the 4D set had meaningful temporal relationships across 3 time steps. Then Conv3D is perfect, with in_channels = 155 and kernel_size = (3, 3, 3). Maybe, on the other hand, you suspect these relationships span all of time, but the depth relationships are limited to within 3 slices. Again, Conv3D is the tool for you, but now in_channels = 20. I should add, you’d need to permute dimensions for both of these cases to work.

What if you think both time and depth relationships are limited to a certain spacetime within your volume? You need Conv4D (which is not included in PyTorch yet). Recognizing why you would need that – so you could drag a kernel through the depth of your voxels across a few (but not all) timesteps at a time – is what really made it clear for me how these work differently.

Using Conv2D on a 3D volume is similar to using Conv1D on a 2D image, and using one of the spatial dimension as a channel dimension. It can be said that Conv2D does not consider all height slices of the 2D images simultaneously, just like Conv3D does not consider all depth slices simultaneously.

Convolutional layer is very similar to a linear layer, e.g. a perceptron. In fact, if kernel size is 1x1, then such convolutional layer is equivalent to applying one linear layer to vectors of channels of each pixel of the input image. Crucially it does not care about the ordering of input channels. For example, you could take your 3D CT scans dataset and randomly permute all of them along the depth dimension, as long you apply the same permutation to all scans, you can still train a Conv2D perfectly fine, even though to you (and to Conv3D) such image would look jumbled if you slice it along width or height dimensions. Conv2D doesn’t see the spatial structure in the depth dimension, so Conv3D is likely going to be more effective. Especially if you do segmentation, passing entire volume Conv2D is extremely unlikely to work and is similar to trying to do segmentation with a multi-layer perceptron.

That being said, you can pass individual slices to a 2D network and it may work fine. In fact, if you pass a few neighboring slices into the channel dimension, it will still help it, because it can work out various relationships between brightness values of those slices. But if you pass entire depth dimension into channels, working that out will be very hard for it.