How to deal with similar labels with positions difference?

kfir7755 · March 23, 2024, 12:53am

I have 2 labels which are very similar, out of multiple labels for semantic segmentation task in computer vision. Their only difference are the location in the image. One of them is for the right part of the image and the second is for the left part. You can think of it as an image with two almost identical vertical lines, the one in the left gets label 1 and the line in the right gets label 2.
How can I preserve the advantages of using convolutional layers instead of Linear layers but getting the positional info from linear layer that you can’t get from convolution.

Currently all I get is overfit for the train in this case but in validation it classifies poorly between those classes (since it is almost impossible to be learned without positional knowledge).

I thought about getting up with hard coding positional info to the images, something similar to how the positional info was added to the transformer in NLP but it’s more complicated than in NLP.

I also thought about using some kind of Fully Connected Classifier for each pixel (where the pixel location, given by row or column gives the positional indexing) but I don’t know how it can be used effectively in the model training.

Any ideas?
Thank you in advance and any attempt to help will be highly appreciated

KFrank · March 23, 2024, 3:26pm

Hi Kfir!

This seems like the right way to go. It might be as easy as adding a channel
to your input image that is just the x coordinate of each given pixel. (You
could also add a second y-coordinate channel to your image, if that would
have any relevance to your use case.)

(You don’t say what model you’re using, but I think such a scheme could work
very with a fully-convolutional semantic-segmentation model, such as U-Net.)

Best.

K. Frank

kfir7755 · March 23, 2024, 5:30pm

Thank you Frank, definitely worth a try.
I’m indeed using U-Net as you suggested

KFrank · March 23, 2024, 9:20pm

Hi Kfir!

Depending of the “field of view” of your U-Net and how far your left-side,
right-side vertical lines are from the edges of your images, your U-Net may
well not have any way of knowing that a vertical-line pixel is near either the
left side or the right side of the image. So adding the location information
as an additional channel might, indeed, be necessary.

(“Normalizing” the location to run from -1.0 on the left to 1.0 on the right,
with 0.0 in the middle, might make your training a little smoother when first
starting out.)

Best.

K. Frank

kfir7755 · March 23, 2024, 9:36pm

100% agree with you. I’ve done it by the following function:

def create_grad(shape):
    # Generate linearly spaced values from -1 to 1 based on width
    width_values = torch.linspace(-1, 1, steps=shape[2])

    # Create the 3D tensor filled with width values
    image = width_values.unsqueeze(0).expand(*shape).contiguous()

    return image.unsqueeze(0)

BTW I’m using 3D U-Net so I don’t know how it will do work for 2D. In my case I wanted the gradient to go in the 3rd dim so it works as I wanted. It returns the image with the wanted “gradient” to be added and another dim for the channel so it will be easy to concatenate it during forward pass.