Semantic Segmentation : How to train?

I am using a custom model based on SegNet to do segmentation for 3 different objects. Model is returning a tensor of size “[5, 3, 120, 160] (batch size, channel, height, width)” and I have a label where each channel contain information of a particular object, something like this:

[[      [1, 0, 0, 0],
        [1, 1, 0, 1],
        [0, 0, 0, 0],
        [1, 0, 0, 0]],

       [[0, 1, 0, 0],
        [0, 0, 1, 0],
        [0, 1, 1, 0],
        [0, 0, 0, 1]],

       [[0, 0, 1, 1],
        [0, 0, 0, 0],
        [1, 0, 0, 1],
        [0, 1, 1, 0]]]

Label: torch.Size([5, 3, 120, 160])

I am using CrossEntropy loss function to calculate the loss but I am getting this error “1only batches of spatial targets supported (non-empty 3D tensors) but got targets of size: : [5, 3, 120, 160]”.

As far as I can understand, I need to have a one-hot encoded vector for every pixel to calculate the loss. In my case, the label does represent one hot encoding (for a pixel, only one channel in the label will have the value ‘1’). May I know what can be done to resolve the issue?

Hello Abishek!

I’m not sure that you’re describing what you are doing consistently.
However, here are some comments based on some of the things
you say.

This is backwards. For CrossEntropyLoss you do not use a
one-hot-encoded label vector. Rather, you use a single integer
class label (per pixel, per etc.).

Let me say what I think you are saying in slightly different words

You are training a model to perform 3-class segmentation
(classification). Your input is a 120x160-pixel image (or, more
precisely, a batch of nBatch = 5 such samples). Your output
is supposed to predict which of three classes each pixels is in.

This looks correct, but I would use slightly different words, and
say that your model returns a tensor of shape
[nBatch, nClass, height, width].

That is, I understand that what you are calling channel = 3
is the number of classes into which you classify pixels.

This looks wrong for use with CrossEntropyLoss.

I’m assuming that you are giving a simplified version of your
label tensor where you are leaving out the nBatch dimension,
nClass = 3, height = 4, and width = 4 (for a shape of
[3, 4, 4]). This looks like a one-hot-encoded version of your
class labels, where the labels are one-hot encoded along the
nClass = 3 dimension of your label tensor.

You should instead have a label tensor of integer class labels:

[       [0, 1, 2, 2],
        [0, 0, 1, 0],
        [2, 1, 1, 2],
        [0, 2, 2, 1]]

This is tensor of shape [height, width] (again, leaving out
the batch dimension), which, in this simplified example is [4, 4].
The number of classes, nClass = 3 is not part of the shape of the
label tensor, but shows up in the values of the labels – they range
over (0, 1, 2) (that is, [0, nClass - 1] inclusive).

To recap:

CrossEntropyLoss takes (for the “multi-dimensional case”) a
prediction of shape [nBatch, nClass, height, width]
(what it calls input), whose entries are logits that run from
-inf to +inf. The label has shape [nBatch, height, width]
(what it calls target), and its entries are integer class labels
that range over [0, nClass - 1], inclusive.

Good luck!

K. Frank

Hi Frank,

Sorry for not being clear with my question and thanks a ton for explaining everything in depth. What you interpreted from my question is exactly what I wanted to convey. And yes, you answered my question. Instead of using one-hot encoding I used integer class labels and it worked. I’ll make sure to be clear with my question from my next post.
Again, Thank you!!