UNET Semantic Segmentation: where resize mask shape if input shape != mask shape?

Hello all,

I am trying to code from scratch the UNET where input dimensions in the paper are:
input.shape = (batch_size, 3, 572, 572) and out.shape = (batch_size, 1, 388, 388).
link here: UNET_PAPER
In this case which is the best practice to resize out.shape?
It would be better to resize it in the Dataset class and load targets already of shape==out.shape or I can just resize the target in the forward method of the UNET? Is resizing a cheap operation or this choice might heavily slow down the computations?
Thanks all!

Hi Alessandro!

You shouldn’t resize out.shape.

Leaving aside the “tiling” of large images (as discussed in the paper
you cite), the shapes for the use case you describe would typically be
[batch_size, 1, 388, 388] for both out.shape and target.shape
(where target is your ground-truth segmentation mask). Your initial,
unmodified input.shape would be [batch_size, 3, 388, 388]
(having the same spatial extent).

However, you would use “mirroring” to increase the spatial extent of your
input images to the necessary [572, 572]. Quoting from page 3 of the
arxiv version of the original U-Net paper:

To predict the pixels in the border region of the image, the missing
context is extrapolated by mirroring the input image.

What’s going on here, is rather than “pollute” the borders by adding
padding to the convolutions, the original U-Net just lets the convolutions
snip off a few border pixels at every convolution stage, so the “image”
shrinks as it passes through the network. To make up for this, they pad
the borders of the input image with the right, rather large, number of pixels
needed to make up for the shrinkage, and – in my mind, sensibly – they
do this with mirroring.

This is the approach you should take.

(When using tiling, they snip out larger, 572x572 tiles out of the input
image and smaller, 388x388 tiles out of the masks, and only use mirroring
to enlarge tiles that lie on the edge of the input image.)


K. Frank

Hello Frank,
Thank for helping
I am still confused tbh:
i.e. in this implementation that I found in GitHub here at line 126 he is cropping the mask to 388 too.
Furthermore I don’t understand what is the purpose of applying that huge padding that you mention to increase the size of the input from 388 to 572. In this case wouldn’t we be adding too much noise into the model considering that 3882=150544,
2=327184–> 327184-150544/327184= 53% of the pixels of each input image would be zeros
Maybe I did not understand you’re suggestions on padding/mirroring but I am very confused

UPDATE: I understood that you meant using torchvision.transforms.Pad(padding_mode=‘reflect’) to mirror the image but in my case. However still why is him cropping the mask? Is it wrong?

Hi Alessandro!

There are any number of U-Net implementations scattered hither and
yon across the internet. I can’t tell you how good they are or how closely
they hew to the original U-Net paper.

First, read the original U-Net paper carefully, understand how the image
gets its edges trimmed off (by the unpadded convolutions) as it passes
through the network, and understand how tiling works in the use case
of the original paper.

With tiling, you are only actually padding (adding “noise” to) those tiles
that are along the edge of the (large) image.

Now consider a use case where your labelled training data consists of
images and masks of size 572x572, and where the images you will wish
to perform inference on (roughly speaking, for example, your test set) are
also of size 572x572.

You can either train by cropping the segmentation mask down to 388x388
and, at inference time, only generate a segmentation prediction for the
central 388x388 subimage of your 572x572 input image. Segmenting only
part of your input image may be good enough for your use case.

Or you can mirror (“reflection-pad”) your input images up to whatever size*
gets trimmed down to (or just larger than) 572x572, train using all of your
labelled training data – that is, using all of the information in your full
572x572 segmentation masks – and then be able to perform inference
on your full 572x572 input images, rather than just their 388x388 central

Your segmentation performance near the edges of your full 572x572 input
images may not be as good as near the middle (but you can evaluate any
such degradation in performance using your test set), but might well be
worth more than not performing any segmentation near the edges.

Your choice.

*) Determining which images sizes flow through U-Net’s convolutions
and 2x2 downsamplings cleanly (so that only even-sized “images” get
downsampled) and what the final output prediction sizes are is something
of a fussy computation, but straightforward.


K. Frank

Hello Frank,
After spending some more hours I finally got the point!
Thanks a lot for your help :cowboy_hat_face: