i have a project in torch to create a 3D semantic segmentation with 3D MRI data stored in NRRD files (that can be converted to 3D tensors). The images are in different but pretty similar shapes.
I tried to create something similar to 3D Unet but in invariant version using torch.nn.Conv3d in the encoder blocks and torch.nn.ConvTranspose3d in decoder blocks but the model wont return the same shape unless all 3 dimensions of the image shape are divided without remainders in 16 (there are 4 encoder blocks with stride of 2).
Is there a way to fix it?
Also possible solution I thought of that is possible if invariant 3D Unet is not possible was padding the image with 0 to reach a shape that is divisible by 16. I am sure that it’s not optimal so if there are other ideas I would love to know them
The lengths of your three dimensions (height, width, depth) of your 3D
U-Net don’t mix as your 3D image passes through the U-Net in the
following sense: The “valid” length of, say, the height dimension (which
you say in your case is “divided without remainders in 16”) doesn’t
depend on the values of the other two dimensions. Each of the three
dimensions can be valid or invalid independently.
So I don’t think the fact that you are working with a 3D U-Net is relevant
to the question I think your asking – your question applies equally to 2D
and 3D U-Nets.
I’m not sure what you mean by an “invariant U-Net.” Are you asking that
the output image have the same shape as the input image (not counting
the channels dimension)?
If so, I don’t think you want this. As your image passes through a U-Net,
its edges get nibbled away by the convolutions. Padding the convolutions
so that the image doesn’t shrink pollutes somewhat the U-Net logic. Think
about how padding the convolutions would be injecting “noise” at the
image edges at each convolution stage.
I highly recommend carefully reading the original U-Net paper, paying
attention to how the shape of the image shrinks as it flows through the
U-Net. Note that the original paper does not pad its convolutions.
The condition for “valid” input-image dimensions is more nuanced for
the original U-Net (and I believe this will be true for any U-Net with
optimally-clean convolution / no-padding logic). Look at the U-Net
diagram in the original paper. It shows how the image shrinks and
from it you can deduce the rules for valid input dimensions.
If by “fix it,” you mean have your U-Net output an image whose size
matches that of its input, you can achieve this by appropriately padding
each of the convolutions along the way. But, as discussed above, you
don’t really want to do this.
My recommendation is to reflection-pad the input image up to a size
that has valid dimensions (which I think will be more complicated than
“divisible by 16”). Your image edges will have a somewhat different
character than the interior, but this is already the reality before you pad,
and I think that reflection padding is the “gentlest” way the let your model
learn to segment pixels near the edges.
You are definitely right about this one, I didn’t want to get options like resizing the image to a shape that is divisible by 16, because it’s only possible with 2D images and not 3D so I focused about 3D but creating invariant 2D Unet is the same problem as creating invariant 3D Unet for me.
Yes, this is exactly what I meant
I never noticed that in the original U-net padding wasn’t used
I’ll read it carefully as you suggested I believe it is important, although in many researches using similar architecture, padding was used some of the time.
Didn’t think about this option, will definitely try since It sounds very reasonable.
Again thank you for your highly detailed answer Frank, really helped me!
Just to be clear, any resizing would happen on a dimension-by-dimension
basis, so there is no substantive difference between 2D and 3D.
I think that many people are not as insightful and / or careful as the authors
of the original U-Net paper. They really did a quite nuanced job of it and I
think that not padding internally is the right way to go.
It can be a real pain in the neck to prepare batches of images of varying
size to have valid dimensions for input into a U-Net, so there is a certain
motivation to cut corners, rather than do it right.
I think what @KFrank is getting at is you do not want to add padding into any of the convolution operations in the UNet.
Regarding an invariant UNet, it’s not really possible/practical, especially given the problem you intend to use it for. As far as I can tell, you have 3D MRI scans, and corresponding targets of identical size. And you want to train the model to precisely identify what pixels contain a particular target.
If you used nn.AdaptiveAvgPool3d within the encoder of the model, you’re going to have a major problem with getting the decoder to give the correct output size that matches your targets, because the size information will be lost.
So any resizing or image padding should be handled in your image preprocessing and applied the same to your targets so they match pixel for pixel.
Hi, I read the original U-net paper and now what you @J_Johnson@KFrank are saying is much more clear and makes sense, I think I will remove padding from the network itself as suggested by you and will use mirror-padding in the pre processing for the images.
Thanks a lot really appreciate the help and explanations.
for anyone who will encounter this post later on, apparently there is a function in the libraray monai - monai.transforms.DivisiblePad which does the padding for this case