Is it possible to train a segmentation model to identify part of a single image without having it overfit?

Hi all,

I have thousands of images that look like below:

The different shapes are the growth we are interested in detecting. Every image has unique shapes in it and some images have hundreds of these objects we are interested in.

Another example is:

We want to extract the shapes only and extract the background.
Since they are all different shapes, it is not possible to train a single model as we have 1 image per shape.

One solution I tried was to train a segmentation model using the input image that has been part labelled and then once it learns the shape, the same image can be used to identify the rest of the shapes.

For example, in labelMe I created a label file for one of the shapes:

Then trained a model using this image.

The assumption was that after training, the model would be able to identify the rest of the shapes in the image. But what actually happens is that it only identifies the training data. The same mask that was used to train it.

This does make sense as the model overfits to the input data, and when the test image is the same training image.

So my question is, given the problem I have, is it even possible to train a model like this? Learn from the input image and then use that to identify the remaining objects in the same image?

Or would I have to look into conventional computer vision approaches for this?

Thank you

Is the training example exactly as presented, with only part of the image labeled? In this case it would be confusing from the model perspective as the loss would actually penalize the model for labeling the rest of the image. You may want to try some preprocessing steps (e.g., cropping/padding images) if labeling them exhaustively is too time consuming. Additionally, some data augmentations (flips, rotations, etc.,) could help as it doesn’t seem like the task is orientation-dependent.

However, based on the well-defined contrast of the borders it might be worthwhile to visit traditional CV approaches to segment shapes (either to check if it works well enough or is a source of training data that can be refined).

1 Like

Thanks for the reply.

Thats what I thought as well. Using the same image for training and testing would not work in this case as the model would overfit to the labelled data and as you mentioned the model would penalize for labelling the rest of the image.

I guess, in my case then it would be best to look at traditional CV methods.

Thanks

So a little update on this…

considering the helpful reply by @eqy, I had another go at this after changing the input data a little bit.

I first extracted only the region of interest and labelled that in LabelMe. So that it is different from the input image and the model learns the shape of my object.

I am using the following input image for this:

Here is the cropped out version of only a single column and the associated screenshot from LabelMe.


I then gave the above as input to the model… and after training I gave the input image back as the test image. But the results were not that good.

I then thought that maybe it is the lack of information in the image and that the majority of the image is empty, so I kept a single column as is and blurred the rest of the image, as can be seen from below:

And I labelled the above image in LabelMe as well…

I then used this image to train a segmentation model and below are the results when that model is applied on the non-blurred input image:

And the masks of the detections:

As can be seen the results are not very impressive.

What can I do to improve the results? Will augmentation help? If so which augmentations should I consider given the above that I want to achieve?

Why is augmentation even needed, as the regions I want to detect are almost exactly like the labelled region.

Why are the results so poor for all regions that look exactly like the input?

Some thoughts on this would be helpful please.

Thank you

Augmentation should help but I would also consider the effect of aspect ratios/scale carefully here. If you are feeding the thin column as input during training time with a resize step, and then doing evaluation with a more “square” image with resizing, then the model will see wildly different aspect ratios at train vs. test time.
So here I would design the augmentations and preprocessing such that they don’t cause a large discrepancy between the training time and testing time inputs (from the perspective of factors such as scale, resolution, etc.) An example of this could also be to break up both training and test images into smaller patches to ensure that the aspect ratios are close regardless of the total size of the input.

But ultimately there is no magic bullet for which augmentations; I recommend browsing through the literature and blog posts to see what is popular these days and what is applicable to your particular problem. The intro/related work section of popular vision papers should be a good starting point e.g., https://arxiv.org/pdf/1912.02781.pdf

1 Like

Hi Faraz!

Let me not answer your specific question about training on only part
of an image, but instead make the following observation:

I would think that if you had the resources to label (that is, to provide
ground-truth segmentation masks for) entire images (rather than just a
single vertical slice as in the example you posted), it could be possible
to train a semantic-segmentation network (such as U-Net) to perform
well on your task.

In addition to seemingly well-defined boundaries between “foreground”
and “background,” the example images you posted have lots of structure
that I imagine such a network could learn. The images you posted have
a relatively regular (within a given image) stripe-like / grid-like structure.
So the idea would be that your network would learn not only local features
that indicate the boundaries, but would also be looking for predicted
boundaries that fit such a (learned) grid-like structure.

A multi-scale network such as U-Net might well be able to learn to
recognize the larger-scale grid-like structure that your images have
(in addition to learning the local features relevant for local boundary
detection).

You also say that you have thousands of images, so – if you’re able to
label them – I would think you could have enough training data.

As for augmentations, flips and rotations by 90 and 180 degrees would
seem to make sense (unless there is a substantive horizontal vs. vertical
difference that isn’t clear from the three example images you posted).

Given that the first two images you posted have different grid sizes
relative to the size of the image, rescaling might also work for augmentation.

From the examples you posted, I can’t tell whether the boundaries of the
images provide substantively useful information. If not, then translations
within the image – which is to say – cropping, could also be useful.

Also, mild blurring and addition of noise could be a legitimate augmentation.

But if you indeed have thousands of images that you could train on, I would
first try training without augmentation, and see how far you get with that.

Best.

K. Frank

1 Like