Understanding channels in binary segmentation

Hi, I am quite new to pytorch and have difficulties with some understanding of channels.

I am doing binary segmentation with deeplab, my input image channel is [N, 3, H, W], my mask input is [N, 1, H, W] (where the values is either 0 or 1). The output, before doing any accuracy or loss, the image channels are [N, 2, W, H] and mask corresponds to [N, 1, W, H]. If I understand correctly, I should one-hot encode the mask to match the image channels, i.e [N, 2, W, H].

The problem is that my prediction of my output from the model is [N, 1, W, H] because of:
preds =torch.argmax(outputs[‘out’], 1)
I do this to normalize and get the values of either 0 and 1 instead of a continuous range.

My question is:
When calculating the loss, should I input the output or pred along with the one-hot encoded mask? The same question with accuracy, how do I compare my pred of size [N, W, H] with mask size [N, 2, W, H]?
I hope it make sense.

It depends a bit on how you would like to implement the binary segmentation.

For the usual use case, you would define a single output channel so that your output would have the shape [batch_size, 1, height ,width], while the target would have the same shape and contain values in the range [0, 1].
If your model is returning logits (no activation at the end of your model), you could use nn.BCEWithLogitsLoss to calculate the loss.
To do so, you would directly pass the model output as well as the targets to this criterion.

To calculate the accuracy, you could apply a threshold (default would be 0.0 for logits) to get the predictions and compare it with the target, if your target contains only ones and zeros.
However, if your target also contains values between 0 and 1, I’m not sure how the accuracy calculation would look like and you could probably apply a threshold of 0.5 on the target, but it depends how you are interpreting the target for this use case.

On the other hand, you could treat the binary segmentation as a multi-class segmentation use case with 2 classes.
For this approach your model would return output logits in the shape [batch_size, 2, height, width], the target would have the shape [batch_size, height, width] and contain the class indices [0, 1] (note the missing channel dimension).
nn.CrossEntropyLoss would be the criterion for this approach.
To calculate the accuracy, you would create the predictions via: preds = torch.argmax(output, 1) and compare it to the target.

Let me know, if you need more information.

4 Likes

Hi ptrblck, thank you so much for your elaborate answer.

When I started, I declared the number of classes as 1 when initializing my deeplab model. The issue was that the values in the output ranged between [-1.5, 1.5]. And taking the argmax of this occurred in only zeros, nothing I could really use for my accuracy measurements.

Defining as I did at start with 1 class seems to be the easier method as it matches the shape with the mask, but I’m not sure how I’ll find the threshold if my values are in a continuous range of [-1, 1.5].

If you don’t mind, could you give me some guidance over why my model output would even be negative and above 1?

Maybe the issue is at my custom dataset. First I was just applying ToTensor() which normalized values between [0, 1] (this still produces [-1, 1.5] range from model output, however, after applying transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])]), they started to range to negative values even before training the model.

To make it easier understanding, here’s some data:

With ToTensor():

    [[0.8824, 0.8824, 0.9059,  ..., 0.8980, 0.8784, 0.8784],
     [0.8824, 0.9059, 0.8941,  ..., 0.8941, 0.8863, 0.8784],
     [0.8784, 0.8824, 0.8706,  ..., 0.8902, 0.8902, 0.8902],
     ...,
     [0.8941, 0.8941, 0.8980,  ..., 0.8941, 0.9098, 0.9216],
     [0.9059, 0.9020, 0.9059,  ..., 0.9098, 0.9255, 0.9255],
     [0.9059, 0.9059, 0.9059,  ..., 0.9216, 0.9255, 0.9255]],

    [[0.8902, 0.8902, 0.8941,  ..., 0.8588, 0.8627, 0.8627],
     [0.8902, 0.8784, 0.8824,  ..., 0.8549, 0.8588, 0.8627],
     [0.8706, 0.8706, 0.8784,  ..., 0.8510, 0.8588, 0.8667],
     ...,
     [0.9098, 0.8980, 0.8902,  ..., 0.8824, 0.8784, 0.8784],
     [0.8863, 0.8863, 0.8824,  ..., 0.8784, 0.8745, 0.8706],
     [0.8863, 0.8863, 0.8784,  ..., 0.8784, 0.8706, 0.8706]]])

With transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

    [[2.0084, 2.0084, 2.0084,  ..., 1.9384, 1.9559, 1.9559],
     [2.0084, 1.9734, 1.9559,  ..., 1.9209, 1.9559, 1.9559],
     [1.9734, 1.9734, 1.8859,  ..., 1.9559, 2.0084, 2.0084],
     ...,
     [2.0259, 1.9734, 1.9909,  ..., 1.9909, 2.0434, 2.1134],
     [2.0084, 1.9909, 2.0259,  ..., 2.0434, 2.0959, 2.0959],
     [2.0084, 2.0084, 2.0259,  ..., 2.0434, 2.0959, 2.0959]],

    [[2.1868, 2.1868, 2.3088,  ..., 2.0125, 2.0474, 2.0474],
     [2.1868, 2.1694, 2.2217,  ..., 2.0474, 2.0648, 2.0474],
     [2.1520, 2.1520, 2.2043,  ..., 2.0823, 2.0997, 2.1346],
     ...,
     [2.1346, 2.1520, 2.1868,  ..., 2.1171, 2.1171, 2.1171],
     [2.1694, 2.1694, 2.2043,  ..., 2.1520, 2.1346, 2.1171],
     [2.1694, 2.1694, 2.2391,  ..., 2.1520, 2.1171, 2.1171]]])

I assume your last layer is a convolution layer with a single output channel.
In that case your model will return logits, which are raw prediction values in the range [-Inf, +Inf].
You could map them to a probability in the range [0, 1] by applying a sigmoid on these values.
In fact, nn.BCEWithLogitsLoss will apply sigmoid and log on the input. However, it’ll use a numerical stable way than manually applying these methods separately.

If we are talking about the probabilities in the range [0, 1] and you would like to use a threshold of 0.5 to determine if the prediction is class0 or class1, you could use a threshold of 0.0 on the logits and you’ll get the same predictions.

So you shouldn’t be worried about the range of the output.

As explained before, don’t use torch.argmax, as it’ll return the max index in dim1, which will always be 0. torch.argmax is used for the multi-class approach, where each output channel corresponds to a class.

1 Like

Thank you, the explanation of what logits are helps. So my normalization values before training is nothing I should worry about either? Or should I keep just ToTensor()?

And indeed, my last outputs of my model is:
(3): ReLU()
(4): Conv2d(256, 1, kernel_size=(1, 1), stride=(1, 1))

I would recommend to use the normalization, as it’s often beneficial for the training.

what is the difference between class and channel?
you stated that multi-class segmentation use case with 2 classes. “[batch_size, 2, height, width]”
but is not this 2 the number of channels?

Yes, dim1 can be seen as the channel dimension, which holds the number of classes, if it’s the output of a model for a multi-class segmentation.
The “meaning” of the dimensions is basically defined by the use case and nn.CrossEntropyLoss expects a model output in the shape [batch_size, nb_classes, *].