Creating a simple Semantic Segmentation Network

Hi everyone!
I’m pretty new to pytorch and interested in Semantic Segmantion.
I know there are severeal pretrained models included in pytorch, but i would like to build one from scratch to really understand what is going on.

What I want is a binary pixelwise prediction that says me for each pixel, if that pixel belongs to a car/human/whatever or whether it is background.

I’ve read the “common” papers like U-Net, FCN, etc. but they usually have more complex skip connections etc… What I would like to try is a very simple Network that receive a: [1,3,28,28] RGB picture and returns a binary pixelwise prediction of the same size. I don’t care if the performance is good or not at the moment.

My main problem is, how to upsample after sending the image through several convolution layers. Obviously the shape is changed due to pooling and the stride size. How do I make sure that the resulting output has the same dimensions like the input image? I’ve read that I can use nn.ConvTranspose2d() to upsample, but I have no idea what values the parameters should take in order to end up with a pixelwise prediction.

Can someone help me out?

Based on this description, it seems you would like to work on a multi-class segmentation rather than a binary segmentation, as you are dealing with multiple classes.
Alternatively, if each pixel might belong to multiple classes, you would work on a multi-label segmentation.

The parameters for upsampling depend on the number of layers you would like to use.
You could “revert” the conv/pooling operation or use another setup with more or less transposed convs.

I’ve written quite a simple UNet implementation a while ago and you can find it here.
Maybe it’s good as a starter code or you could have a look at the setup of the transposed conv layers.


Thank you so much! Very easy to understand implementation!

However, I do have 2 questions:

  1. your x (line 112) has shappe (1,3,96,96) (i assume (batch_size,nb_channels,heigth, width)), your y (line 113) has shape (1,96,96) and your model output has shape (batch_size, nb_classes, 96,96).
    –> How is your loss function calculated. Shouldn’t y and output have the same dimensions? Or is that something special NLLLoss does?

  2. How did you know that your model output will have the same height and width like your input image (,,96,96)?
    I mean all the parameters (stride, padding, etc.) are directly influencing the output shape. How did you know that the parameters you choose (e.g. when instantiating the model (line 115 - 120) or your upconv-layer (line 53)) will lead to the same output shape? Is it just try and error or is there some trick? :smiley:

Thanks again for sharing the code with me!

nn.CrossEntropyLoss (and thus also nn.NLLLoss) expect the model output to have the shape [batch_size, nb_classes, height, width] and the target [batch_size, height, width] containing the class indices for a segmentation use case.

For even inputs shapes and “simple” convolutions, you can easily create the transposed conv so that you’ll get the same output shape.
However, if you are using odd input shapes, dilations, etc. you might need to use output_padding in your transposed convs.
Usually, I add print statements after each layer and make sure I get the shape I need.

1 Like