# Understanding channels in binary segmentation

Hi, I am quite new to pytorch and have difficulties with some understanding of channels.

I am doing binary segmentation with deeplab, my input image channel is [N, 3, H, W], my mask input is [N, 1, H, W] (where the values is either 0 or 1). The output, before doing any accuracy or loss, the image channels are [N, 2, W, H] and mask corresponds to [N, 1, W, H]. If I understand correctly, I should one-hot encode the mask to match the image channels, i.e [N, 2, W, H].

The problem is that my prediction of my output from the model is [N, 1, W, H] because of:
preds =torch.argmax(outputs[āoutā], 1)
I do this to normalize and get the values of either 0 and 1 instead of a continuous range.

My question is:
When calculating the loss, should I input the output or pred along with the one-hot encoded mask? The same question with accuracy, how do I compare my pred of size [N, W, H] with mask size [N, 2, W, H]?
I hope it make sense.

It depends a bit on how you would like to implement the binary segmentation.

For the usual use case, you would define a single output channel so that your output would have the shape `[batch_size, 1, height ,width]`, while the target would have the same shape and contain values in the range `[0, 1]`.
If your model is returning logits (no activation at the end of your model), you could use `nn.BCEWithLogitsLoss` to calculate the loss.
To do so, you would directly pass the model output as well as the targets to this criterion.

To calculate the accuracy, you could apply a threshold (default would be `0.0` for logits) to get the predictions and compare it with the target, if your target contains only ones and zeros.
However, if your target also contains values between 0 and 1, Iām not sure how the accuracy calculation would look like and you could probably apply a threshold of `0.5` on the target, but it depends how you are interpreting the target for this use case.

On the other hand, you could treat the binary segmentation as a multi-class segmentation use case with 2 classes.
For this approach your model would return output logits in the shape `[batch_size, 2, height, width]`, the target would have the shape `[batch_size, height, width]` and contain the class indices `[0, 1]` (note the missing channel dimension).
`nn.CrossEntropyLoss` would be the criterion for this approach.
To calculate the accuracy, you would create the predictions via: `preds = torch.argmax(output, 1)` and compare it to the target.

4 Likes

When I started, I declared the number of classes as 1 when initializing my deeplab model. The issue was that the values in the output ranged between [-1.5, 1.5]. And taking the argmax of this occurred in only zeros, nothing I could really use for my accuracy measurements.

Defining as I did at start with 1 class seems to be the easier method as it matches the shape with the mask, but Iām not sure how Iāll find the threshold if my values are in a continuous range of [-1, 1.5].

If you donāt mind, could you give me some guidance over why my model output would even be negative and above 1?

Maybe the issue is at my custom dataset. First I was just applying ToTensor() which normalized values between [0, 1] (this still produces [-1, 1.5] range from model output, however, after applying transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])]), they started to range to negative values even before training the model.

To make it easier understanding, hereās some data:

With ToTensor():

``````    [[0.8824, 0.8824, 0.9059,  ..., 0.8980, 0.8784, 0.8784],
[0.8824, 0.9059, 0.8941,  ..., 0.8941, 0.8863, 0.8784],
[0.8784, 0.8824, 0.8706,  ..., 0.8902, 0.8902, 0.8902],
...,
[0.8941, 0.8941, 0.8980,  ..., 0.8941, 0.9098, 0.9216],
[0.9059, 0.9020, 0.9059,  ..., 0.9098, 0.9255, 0.9255],
[0.9059, 0.9059, 0.9059,  ..., 0.9216, 0.9255, 0.9255]],

[[0.8902, 0.8902, 0.8941,  ..., 0.8588, 0.8627, 0.8627],
[0.8902, 0.8784, 0.8824,  ..., 0.8549, 0.8588, 0.8627],
[0.8706, 0.8706, 0.8784,  ..., 0.8510, 0.8588, 0.8667],
...,
[0.9098, 0.8980, 0.8902,  ..., 0.8824, 0.8784, 0.8784],
[0.8863, 0.8863, 0.8824,  ..., 0.8784, 0.8745, 0.8706],
[0.8863, 0.8863, 0.8784,  ..., 0.8784, 0.8706, 0.8706]]])
``````

With transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

``````    [[2.0084, 2.0084, 2.0084,  ..., 1.9384, 1.9559, 1.9559],
[2.0084, 1.9734, 1.9559,  ..., 1.9209, 1.9559, 1.9559],
[1.9734, 1.9734, 1.8859,  ..., 1.9559, 2.0084, 2.0084],
...,
[2.0259, 1.9734, 1.9909,  ..., 1.9909, 2.0434, 2.1134],
[2.0084, 1.9909, 2.0259,  ..., 2.0434, 2.0959, 2.0959],
[2.0084, 2.0084, 2.0259,  ..., 2.0434, 2.0959, 2.0959]],

[[2.1868, 2.1868, 2.3088,  ..., 2.0125, 2.0474, 2.0474],
[2.1868, 2.1694, 2.2217,  ..., 2.0474, 2.0648, 2.0474],
[2.1520, 2.1520, 2.2043,  ..., 2.0823, 2.0997, 2.1346],
...,
[2.1346, 2.1520, 2.1868,  ..., 2.1171, 2.1171, 2.1171],
[2.1694, 2.1694, 2.2043,  ..., 2.1520, 2.1346, 2.1171],
[2.1694, 2.1694, 2.2391,  ..., 2.1520, 2.1171, 2.1171]]])``````

I assume your last layer is a convolution layer with a single output channel.
In that case your model will return logits, which are raw prediction values in the range `[-Inf, +Inf]`.
You could map them to a probability in the range `[0, 1]` by applying a `sigmoid` on these values.
In fact, `nn.BCEWithLogitsLoss` will apply `sigmoid` and `log` on the input. However, itāll use a numerical stable way than manually applying these methods separately.

If we are talking about the probabilities in the range `[0, 1]` and you would like to use a threshold of `0.5` to determine if the prediction is class0 or class1, you could use a threshold of `0.0` on the logits and youāll get the same predictions.

So you shouldnāt be worried about the range of the output.

As explained before, donāt use `torch.argmax`, as itāll return the max index in dim1, which will always be 0. `torch.argmax` is used for the multi-class approach, where each output channel corresponds to a class.

1 Like

Thank you, the explanation of what logits are helps. So my normalization values before training is nothing I should worry about either? Or should I keep just ToTensor()?

And indeed, my last outputs of my model is:
(3): ReLU()
(4): Conv2d(256, 1, kernel_size=(1, 1), stride=(1, 1))

I would recommend to use the normalization, as itās often beneficial for the training.

what is the difference between class and channel?
you stated that multi-class segmentation use case with 2 classes. ā[batch_size, 2, height, width]ā
but is not this 2 the number of channels?

Yes, `dim1` can be seen as the channel dimension, which holds the number of classes, if itās the output of a model for a multi-class segmentation.
The āmeaningā of the dimensions is basically defined by the use case and `nn.CrossEntropyLoss` expects a model output in the shape `[batch_size, nb_classes, *]`.