Feeding 3D volumes to Conv3D

Hi there, I’m trying to feed 3D volumes through a NN which has a Conv3D layer.
For simplicity I have:

self.conv1 = nn.Conv3d(125, 2, 3)

and in the forward:

return self.conv1(x)

My volume is 125x256x256. When I try to feed a random tensor:


I get:

RuntimeError: Expected 5-dimensional input for 5-dimensional weight [2, 125, 3, 3, 3], but got input of size [1, 125, 256, 256] instead

I do not understand why is it asking a 5-dimensional input

If your input has the shape [batch_size, channels, height, width], you should use a nn.Conv2d layer.
For nn.Conv3d you need an additional dimension, e.g. time or slices through an MRI.

1 Like

I’m using MRI, but the volume is 256x256x125. Should I use conv2d or conv3d?

I’m not sure what would work best for your use case, but let’s have a look at the differences between both approaches.

I assume your MRI data has a spatial size of 256x256 and contains 125 slices.
If you’re using nn.Conv2d I would suggest to use the slices as the “channels”.
This would mean that each kernel in your conv layer will have the defined spatial size, e.g. 3, and will use all channels. The kernel shape would be [nb_kernels, 125, 3, 3]. The output will thus be calculated by using a dot product of the small 3x3 window and all slices.

On the other hand, if you are using nn.Conv3d, you could permute the slices to the “depth” dimension and add a single dimension for the channels. This would mean your kernel is now a volume with the shape e.g. [3x3x3], such that 3 neighboring slices will be used to produce the output for the current position.

What would you like your model to learn in the MRI images? I think in a segmentation case the second approach could work better, but that’s just my guess.

1 Like

Thank you very much for the explanation, super clear! Right now I’m working on classification, so maybe the first approach is more suitable as well as more light in term of number of parameters.

Hello guys,
I want to use conv3d over a spatial matrix of words. Each cell represents an embedding vector of size 100. Hence, an input is of shape - [batch size, height , width, embedding_dim].
1 - Should I use in_channels = embedding_dim ?
2 - I want to get output of size - [batch_size, height, width, output_size], where output_size is any desired int. I wish to get feature of output_size after convolution for each cell to get the spatial feature for a word (cell).
How would I use conv3d in this case? Please suggest the correct usage and if I am thinking correctly.


It depends.

In the Conv2D case, the expected input is [batch_size, in_channels, height, width]. When we perform a 2d convolution, each filter is acting upon all the channels. We sort of assume there isn’t a depth aspect in our channels. For example, consider RGB, it’s not like it has to be in that order, we just keep it in that order for convention.

In the Conv3D case, the expected input is [batch_size, in_channels, depth, height, width]. As an example of this would be video. In this case, order does matter since it’s not like we can shuffle all the frames, we end up losing meaning. We use a 3D convolution to embed that information, where each filter strides across our frames (rather than act on them all).

Now for your specific case, it doesn’t sound like you have a “depth” aspect to your data. The entire embedding vector as a whole seems important, it doesn’t make sense to me to stride over your embedding dimension. If I’m understanding correctly, I think you want a Conv2d with kernel_size=1 and stride=1. Your in_channels would be embedding_dim, and your output_channels would be whatever feature size. That’ll result in an output of shape [batch_size, output_channels, height, width]. Keep in mind the order of things.

1 Like

I have input CT of size 100x512x512, I want to enhance the quality of the CT using corresponding higher quality image. Can I use a Conv3d to stride stride over 3d volumes of neighboring slices to produce a 2d image at the end ?

The 3D convolution would return an output volume, but you could try to reduce one of the dimensions (e.g. the depth).
I.e. an input of [batch_size, channels, depth, height, width] would result in an output of [batch_size, out_channels, depth*, height*, width*], where the * shapes are calculated depending on the kernel size, stride, dilation etc.

1 Like

I have a question abou using video with Conv3D.
My input is like [batch_size=128, channels=3, depth=32, height=64, width=64]

Does it matter to feed the input like:
first case: [batch_size=128, channels=3, depth=32, height=64, width=64]
second case: [batch_size=128, channels=3, height=64, width=64, depth=32]

If I feed like second case, what will happen? The network will not learn or it will not learn efficiently?

Thank you

I don’t think you should see a significant difference as long as the data augmentation and transformations are applied to the corresponding dimensions, since the activation volume would just be permuted, i.e. the same would happen if you transpose the height and width of an image.

Thank you for the reply.
Let me ask a queston related to conv3d which is related to this post.

I have a code in keras. I have a problem understanding the “same” padding.
My input shape to this layer is: [128, 2048, 4, 2, 2]
This is my keras code:
combined = Conv3D(128, (3, 3, 3),strides=1, padding=‘same’)(combined)

And I want to do it in pytorch. What I did:
self.conv1 = nn.Conv3d(2048, 128, kernel_size=(3,3,3), stride=1, padding=( 1,1,1))

Are these perform the same?

Yes, for a kernel size of (3, 3, 3) (and a stride of 1) you would have to use a padding value of (1, 1, 1) to get the same output shape.