Logit normalization and loss functions to perform instance segmentation

The goal is to perform instance segmentation with input RGB images and corresponding ground truth labels.

The ground truth label is multi-channel i.e. each class has a separate channel and there are different instances in each channel denoted by unique random numbers. Each label consists of 5 non-background classes (no background class info). The following figure will give a better insight into the labels (please note that the actual dimension is 128x128 not 7x7).

A label may have multiple instances of the same class objects.

Logit predictions were obtained on a batch data [64, 5, 128, 128] i.e. [B, C, W, H] from a typical U-Net (in a forward pass) and then passed through a sigmoid layer to get the normalized predictions in the interval [0,1]. There were so many problems encountered right before performing the backward pass and some of those are:

  1. nn.BCELoss(): reported loss for the batches was either too low i.e. {0.0310, 0.0470, 0.1696,…} or negative i.e. {-0.2363, -0.0790, -0.1972,…}. I can’t think of a reason other than normalization of logits BUT sigmoid layer had already normalized the target values in interval [0,1]. (?)

  2. nn.CrossEntropyLoss(): This loss gave an error (as reported below). My understanding is to construct a single channel i.e. [64, 1, 128, 128] to use this loss function but I haven’t tried this idea.

ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [64, 5, 128, 128]

I do understand that Dice, IOU, Jaccard,… are the popular loss functions for segmentation problems in addition to the cross-entropy based losses.

Would you please tell me where did I make the mistakes above? Is it the formulation of target labels?

Hi Stark!

It’s not entirely clear to me what you are doing.

But to answer two technical questions:

If I take at face value the images of numbers you posted in your image
of textual information
by assuming that these are your actual target
values, then they are outside the [0.0, 1.0] range required by
BCELoss (and BCEWithLogitsLoss). BCELoss requires its target
values to be probabilities between 0 and 1. Hence your unreasonable
loss values.

Do you really mean that you have passed your target values through
Sigmoid? Or do you mean that your input to BCELoss (the output
of your model) has been passed through Sigmoid?

CrossEntropyLoss requires an input and target with different shapes,
where input has an nClass dimension, and target does not. For
example, if your input is of shape [nBatch, nClass, width, height],
your target should have shape [nBatch, nClass, width, height].

It appears that your input and target don’t satisfy this relationship. Hence
your error.

Now some comments:

Are your really applying your model to RGB images?

If you were using RGB images, I would expect your model to take a batch
of data with shape [64, nColor = 3, nChannel = 5, 128, 128] (or

Are you somehow folding your RGB channels into your larger set of
five channels?

Does the number – 5 – of “ground truth” channels have anything to do
with the apparent number – also 5 – of input channels, or is this just a

What does this labelling scheme mean? The first image of a number in
your image of textual information is 0. The second (distinct) number is
32. What does 32 mean? Is it really random? Are you expecting your
network to learn the sequence of random numbers generated by your
random number generator?

Does a “32” in “ch #0” mean the same thing as a “32” in, say, “ch #2”?
Does a “32” in sample 3 of a batch mean the same thing as a “32” in,
say, sample 12 of that batch?

Good luck.

K. Frank

Thanks @KFrank for your time.

I want a U-Net to learn the ground truth labels to perform the instance segmenttaion task. In semantic segmentation, all objects of the same type are marked using one class label while in instance segmentation similar objects get their own separate labels.

The problem is I cannot figure out a suitable encoding scheme that can work with the loss functions implemented in the PyTorch.

The input to BCELoss is the model output after passing through sigmoid. Overall, I followed this sequence:
RGB images + 5D labels (input) —> model —> 5D logit output —> sigmoid() —> BCELoss()/CrossEntropyLoss()

The number of input channels to a model can be different from the number of output channels. Please correct me if I’m wrong. I’ve seen so many segmentation models designed this way while using the CrossEntropyLoss(). My best guess is some problem with the way target labels are encoded.

Yes, my model takes a batch of [3, 128, 128] images and [5, 128, 128] of target labels as input.

I would like to read some relevant research work using such a formulation. It’s quite interesting that you used nColor and nChannel differently as opposed to the conventional wisdom that says nColor = nChannel. For instance, nColor = nChannel = 3 in an RGB image of size 3x128x128.

Can you please elaborate on this?

A single label is of size [5, 128, 128] where nChannel = 5 is indicating the nClass = 5 which means foreground pixels in each input image may belong to any of the five classes.

Referring to the figure in the initial post, 0 means a background pixel, any non-zero value means a foreground pixel. Random numbers were assigned just to distinguish multiple foreground instances belonging to the same class (Note: each class is a separate channel). By the way, assigning 1 to every foreground pixels in target labels (as in the figure below) let BCELoss() to produce reasonable values. However, I’m not sure how the multiple occurrences of foreground objects from the same class can be identified in the model output (since the model cannot infer this from the labels).

Not at all since the foreground objects have heterogeneous shapes and can be found anywhere in an image. It’s nothing more than a way to tell the model that there are different instances of the class-X.

Hi Stark!

Okay, now I think I understand your use case.

Unfortunately, I don’t think I have anything useful to say about “instance
segmentation.” Perhaps some experts will chime in.

I do stand by my technical comments about the immediate causes of the
errors your reported. Some clarifying comment appear in line, below.

I’m not aware of any built-in pytorch loss functions that are directly
applicable to instance segmentation, but, like I said, I’m not knowledgeable
about this use case.

When you say “5D labels (input)” the only thing I think you can be referring
to are the ground-truth labels (“target”) that you use for training. Do you
really feed in the ground-truth labels as input into your model?

Also, you can’t use the output of the same model as input to both
BCELoss and CrossEntropyLoss (if they are using the same target).
Wrong shapes.

As I mentioned in my previous post in the context of BCELoss, I am using
input to refer to what pytorch calls the input to your loss function. This
is the output of your model, not the input to your model. (This bit of pytorch
terminology can lead to confusion.)

Note, in the bit you quoted, I gave the wrong shape for the target fed to
CrossEntropyLoss: It should be [nBatch, width, height], without an
nClass dimension.

Again, does your model really take target labels as input. Or are you
referring to your overall training procedure?

Leaving this aside, if I understand “target labels” to be the target
you feed to BCELoss, then your shapes do makes sense. To use my
preferred terminology, the input to your model is a tensor of shape
[nBatch, nChannel = nColor = 3, 128, 128]. (And, yes, it is
common to refer to the three colors as “channels.”)

Your model then outputs predictions (the input to your loss function) of
shape [nBatch, nClass = 5, 128, 128]. This would be appropriate for
BCELoss and (something analogous to) a multi-label, multi-class problem.

The target you feed your loss function should, for BCELoss, have the
same shape ([nBatch, 5, 128, 128]) (which it appears it does), while,
for CrossEntropyLoss, it should have a shape lacking the nClass
dimension ([nBatch, 128, 128]).

Yes. By doing so you are (properly) training a conventional multi-label,
multi-class classifier.

But, also yes, doing so won’t perform instance segmentation for you
(which, again, I don’t know how to do). That is, I agree with your analysis
that, while setting the all of the foreground pixel values to 1 does fix your
problem of bad BCELoss values, doing so throws away the information
you need to perform instance segmentation.


K. Frank