How would you train a segmentation model with varying image sizes?

Hi, I’m trying to train a two-step process for image instance segmentation of two classes, and this is multi-label since a pixel can belong in both classes at the same time (a part of the image can be both class A and class B, for example, class A can be a larger area and class B can exist inside it, hence some of the area being both class A and B). However, the multi-label nature isn’t the problem- the problem arises due to the training data image sizes being very different (as small as 10 pixels by 10 pixels, and as big as about 300 by 300 pixels) and also rectangular.

This is because this model is being trained to be a downstream model that takes the inference of a Yolo object detection model’s bounding boxes as an input. The resolution and scale of the images are the same, since they are just different sizes of bounding boxes, but if I were to train a vanilla UNet, I would have to resize and pad them to a uniform size to make sure aspect ratio the same. However, this would not be ideal since the background of my original images are not white, and the backgrounds of each image are always different, so padding might not be the best option.

I just wanted to ask the community about their opinions, of what the best idea would be. Pretty much every model out there requires a fixed image size (which are mostly square and product of 16’s). Maybe trying a scale-invariant model like DeepLabv3+ might be better since it uses ASPP? But DeepLab also requires a uniform input image size.

Would appreciate any inputs from people who has had similar problems, thank you!

Hi Choke!

The short story: Use U-Net with reflection padding and be sure to
understand the padding requirements of a properly-architected U-Net.

As you mention, I would consider using U-Net. It’s a solid, widely-used
semantic-segmentation model. As a fully-convolutional model, it can be
trained on, and then used for inference for, images of varying sizes.

The architecture of the original U-Net was designed quite carefully, and,
as a result of that design, while it accepts images of arbitrarily-large size,
it can’t accept input images of arbitrary size. I call the acceptable input
sizes conformable with the specific U-Net (but I don’t think this term is

Let me assume that your input images and ground-truth targets are the
same size. You will then need to pad your targets to be conformable with
the output of the U-Net and pad your inputs to be conformable with the
input to the U-Net. The (padded) input size will be larger than the (padded)
target size, because the image shrinks when it flows through the U-Net as
unpadded convolutions nibble away at its edges.

I strongly recommend that you use reflection padding for the input. The
padding for the target won’t matter (because you will mask it), so I would
just use zeros.

It sounds like you are performing a multi-label, two-class segmentation.
So your U-Net output should have two channels (per pixel) and you should
use BCEWithLogitsLoss as your loss criterion.

However, only your original unpadded target image contains meaningful
information. So you should mask your loss computation to only include
that original unpadded region. I would suggest using
BCEWithLogitsLoss (reduction = 'none') and then perform the
masked reduction yourself. (You could also use BCEWithLogitsLoss’s
weight constructor argument to perform this masking.)

One last technical point that’s a pytorch rather than U-Net issue. If you
use a batch size greater than one, then the individual samples within a
batch all have to be the same size (because they’re slices of the a single
pytorch tensor). So you have to pad your input and target images to the
largest (padded) size of any of the samples in the batch.

Depending on your use case, you may well not need to resize any of
your images or change their aspect ratios. Within a batch, all of the
images will be padded to have the same size (and hence the same
aspect ratio), but the pixels inside of the padding on the borders will
be the pixels from the original images without any resizing or stretching.


K. Frank