Conv3D model input tensor

Hello, I am new to PyTorch and I want to make a classifier for 3D DICOM MRIs. I want to use the pretrained resnet18 from monai library but I am confused with the input dimensions of the tensor. The shape of the images in my dataloader is [2,160,256,256] where 2 is the batch_size, 160 is the number of dicom images for each patient and 256x256 is the dimension of the images. When I try to run the model I get this error: Expected 5-dimensional input for 5-dimensional weight [64, 3, 7, 7, 7], but got 4-dimensional input of size [2, 160, 256, 256] instead

If I unsqueeze the tensor before feeding it to the model I get: Given groups=1, weight of size [64, 3, 7, 7, 7], expected input[1, 2, 160, 256, 256] to have 3 channels, but got 2 channels instead Can anybody help me figure this out ?

Hi gkrisp!

This is telling you that the first Conv3d layer of your resnet has a
weight with shape [64, 3, 7, 7, 7], which is to say that is is a
Conv3d (in_channels = 3, out_channels = 64, kernel_size = 7).

Therefore the input to your resnet (and hence to this Conv3d) must
have shape [nBatch, nChannels = 3, height, width, depth].

Your first error was caused because your input didn’t have the nBatch
dimension. This dimension can have size one if you want, but it has to
be present.

unsqueeze() provided the required nBatch dimension, but you then
get the second error because your input only has 2 channels, while
this particular Conv3d is expecting 3 channels.

Because your resnet is pretrained (and I assume that you don’t want
to throw that away), you should probably add a third channel to your
input, using something redundant like a duplicate of your second channel
or the average of your two “real” channels.


K. Frank

Hi Frank, thanks a lot for your answers. I have some more questions if you don’t mind.
Do you think it’s better to use a resnet with 2d convolutions and override the first layer to have input of 160 channels (160 is the number of 256x256 images for each mri) with this command:

model.features[0] = nn.Conv2d(160,64,7,7,7)

But in that case, the model takes 160 channels and then squish them to 64. Usually the models first expand the channels before reducing them. Would that be a problem ?

The input shape is [2,160,256,256] where theoretically 2 is the nBatch. I don’t know if I’m not getting that right.

Hi gkrisp!

It would almost certainly be worse, if I understand correctly that you
are working with 3d images.

The point is that your images have substantive spatial structure.
That is, just as a pixel with x = 17 is right next to x = 18, but far
away from x = 148, a slice in your 3d “z-stack” with z = 17 is
right next to the z = 18 z-slice, but far away from z = 148.

Convolutions know about and respect this structure, while a general
fully-connected layer does not (and the in_channelsout_channels
part of a convolutional layer is fully connected).

You could use a Conv2d with in_channels = 160 (and with all of
those extra connections, it would be a superset of the Conv3d so,
in principal, could perform the same inference if properly trained),
but such a network would have to “learn” the spatial structure of your
z dimension, which could take a long time or not happen at all with
realistic amounts of training data and time.

As a general rule, if your problem has some particular structure and
there is a natural way of “telling” your network about that structure
by building that structure into the architecture, you’re much better off
doing so, rather than forcing your network to “learn” that structure.

As an aside, if your model were pretrained, so, in particular,
model.features[0] were pretrained, doing this would throw away
that part of the pretraining, which could be costly.

Yes, my mistake. You said in your first post that 2 is your batch size.

The basic point still stands, however. Pytorch’s convolutional layers
require both a batch and a channels dimension, even if they are
“trivial,” singleton (that is, size = 1) dimensions.

So, if your input image has shape [2, 160, 256, 256], with nBatch = 2,
and no explicit channels dimension, you have to add the required
channel dimension, e.g.:

image = image.unsqueeze (1)

However, now you have nChannels = 1, and your pretrained resnet
requires nChannels = 3, so probably the best solution (if you want to
keep the benefits of your pretrained model) is to expand() your input
to have 3 (redundant) channels:

image = image.unsqueeze (1).expand (-1, 3, -1, -1, 1)

Now image will have shape [2, 3, 160, 256, 256].

(Note that expand() just provides a logical view into your 1-channel
image – the 2 additional channels don’t actually “exist” in the sense
of being stored in memory or being able to have values that differ from
those of the one “real” channel.)


K. Frank

image = image.unsqueeze (1).expand (-1, 3, -1, -1, 1)

I tried this for a small amount of my dataset and it worked. Now I am going to do it on the whole dataset (435 3D MRIs).
So, basically all this does is ‘transforming’ the grayscale images to RGB (3 channels).
Also, I could find a model that initially was trained for grayscale 3D images and therefore I wouldn’t face this problem right ?
Thank you again for sharing your knowledge, you’ve been really helpful!!

Hi gkrisp!

I’ve haven’t experimented with this, so I don’t have any firm knowledge
here. But based on my intuition, I would say:

In an ideal world, if you have a one-channel, grayscale use case, then
all else being equal, a pretrained one-channel model would seem to
be the best. But I would expect any one-channel vs. three-channel
benefit to be pretty marginal. I would think that other details of the
architecture or the quality of pretraining would matter much more.
I just don’t see using a three-channel model on grayscale images as
being a problem in any practical sense.


K. Frank