Semantic Segmentation Training

mohitsharma916 · November 8, 2017, 10:09pm

I am trying to do Semantic Segmentation in PyTorch.

Before passing the image through the segmentation network, I downsample the image to size (321x321). (I also apply the same downsampling to the ground truth segmentation mask to retain the pixel-level correspondence)

Once I pass this (321x321) size image through the segmentation network, I get a 41x41xC sized per-class prediction map, where C is the number of classes.

To calculate the loss, I have to either Upsample the prediction map from 41x41xC to 321x321xC or downsample the ground truth segmentation mask to 41x41. Which one should I pick?

Also, if I have to upsample the ground truth segmentation mask, what is the right way to do fractional upsampling?

SimonW · November 8, 2017, 10:18pm

That depends what your goal is. If your goal is to get a low res semantics map, then downsample the target. If you want to get high res semantics map, upsample the output (I personally would do convtranspose to upsample).

mohitsharma916 · November 9, 2017, 1:42am

Thanks for the input.

I read about the transposed convolution and from what I understood, the parameters (stride,filter size,padding, etc) of the convTranspose operation from 41x41 to 321x321 should be the same as parameters for some convolution operator to go from 321x321 to 41x41.

Is this right?

SimonW · November 9, 2017, 1:43am

You are right

mohitsharma916 · November 9, 2017, 1:58am

The simplest thing I can do is use a 1x1 kernel with a stride of 8. Should I expect some tradeoff in performance by choosing different kernel sizes and their corresponding strides?

herleeyandi · December 16, 2017, 10:46am

Hello I am still new in Semantic segmentation using pytorch. Suppose I have binary segmentation problem (where the class only 2). suppose I have size with this order [batch, channel, width,height]. I have batch size = 8.

 My output net = [8, 2, 96, 160] -> 2 in channel because have 2 class
 My ground truth(mask) = [8,1,96, 160] -> consist of 0 and 1 as the class

How can I measure the loss?, is there any way to use CrossEntrophy2D loss?

mohitsharma916 · December 17, 2017, 12:53am

You can use NLLLoss2d.

However, you might need to squeeze your ground truth mask on the 1st dimension. NLLLoss2d expects masks with a shape (N,H,W).

herleeyandi · December 17, 2017, 2:26am

@mohitsharma916 Thank you. I have done it before. The result is still bad. Do you think we miss something?, because I add log_softmax after nllloss2d. Do you think it necessary?. My groundtruth is in range 0-1(binary classification) and my image is 0-255.

mohitsharma916 · December 17, 2017, 2:47am

NLLLoss2d expects the log probabilities as one of the inputs. Make sure you convert your outputs to log probabilities using LogSoftmax.

SimonW · December 17, 2017, 4:25am

1x1 kernel won’t be good. You should choose a kernel size divisible by stride to avoid artifacts.

mohitsharma916 · December 17, 2017, 4:42am

Thanks for the suggestion.

However, I ended up using simple bilinear upsampling (with necessary cropping to match the input image size). To my surprise, it actually produced good results. On the contrary, using transposed convolution made training unstabe (maybe I need to do better parameter tuning to make it work).