Same output when using Multilabel BCE

Hi all,

Iam working on a segmentation project, where the input are CT images and the output consists of segmentations with a total of 3 classes.
My first model was trained by transforming the segmentation map into a one-hot encoded vector and applying a softmax after the final layer. (so actually 4 class labels when counting the background)
That works quite ok, but I have noticed that labels can overlap in certain cases.

My idea was therefore to stack the 3 different segmentation maps into 3 seperate image channels, where each segmentation is binary (background = 0, foreground = 1).

To train the model I used the BCEwithLogits loss and during training the result looks good.

Nevertheless, when using the model to predict new segmentations for a given CT image, the model outputs the same segmentation in all three channels and I can’t figure out whats wrong…

(Just to be sure that Iam not providing wrong target segmentations, I already checked those…)

Any ideas or hints would be awesome!

Thanks in advance,



Hi Michael!

What you describe is a standard approach. You should be able to
make it work, assuming you have enough good training data, and
your network is appropriate for the task.

I have some comments in line, below.

Because you can transform your segmentation map into one-hot
encoding, it sounds like you are working on a single-label, multi-class
classification (segmentation) problem.

Applying softmax() also suggests that you have a single-label
problem. (In such case, however, you would use CrossEntropyLoss
without the softmax() as CrossEntropyLoss has, in effect, the
softmax() built in. You would also not use one-hot labels with

“Overlapping” labels sounds like a multi-label, multi-class problem.

I assume that you mean you sometimes have overlapping target
labels – your ground-truth annotations that you use to train – rather
than overlapping predicted labels.

But if you do have overlapping labels, how can you convert them
to one-hot encoding?

This is appropriate for multi-label image segmentation.

This is also appropriate. But, just to be sure, you should be feeding
the output of your final Linear layer as your logit predictions to
your BCEWithLogitsLoss loss function, without passing it through
softmax() or sigmoid().

I assume that this means that your training-set loss goes down and
your training-set accuracy goes up a sensible way.

This, of course, is not what you want. Also, this should show up
quantitatively in that your test-set loss and accuracy should be
much worse that that for your training set.

One thing to check for: Do you have an unbalanced dataset where
some classes (including your background class) appear much more
frequently than others? If so, consider using BCEWithLogitsLoss's
pos_weight argument to help account for this.

Good luck.

K. Frank


Hi and thanks for your great and detailed answer!

Here I need to clarify some things about my data:
My dataset consists of one large bone structure. This bone structure has been segmented as a whole but it has also been split and segmented as three single structures.

I actually use a combination of CE and Soft-dice which seems to work quite well.

Good question! Some further clarification:
The network that uses one hot encoded labels was only trained in order to predict the three single structures.
What I noticed is that the network struggles somewhat when one structure ends and The next begins. This is just because they are basically belonging to the same bone structure. Nevertheless I want my model to be able to differentiate between them.

In addition I want my model to be able to predict the three single structures as well as the whole structure.
This is what I meant with ‘overlapping (target) labels’ and where I had the idea of stacking the segmentations into different image channels.

That is probably the case as the whole structure has much more foreground pixels than the single structures on their own.

I made a quick check and removed the channel with the whole structure and that seems to work.

What I don’t understand though is why the predictions in the different channels become the same when I include the segmentation of the whole structure.
I would have expected that the channel wise Dice coeffient (which is used as an evaluation metric), would be good for 1 channel but would get worse for the other three as the whole segmentation takes over the loss function…

Best regards,


One question that came to my mind:

Do the images have to been one-hot-encoded for BCEwithLogits?
Right know Iam just tacking the different segmentation masks into 4 different channels:
ch1: whole structure
ch2: sub_structure 1
ch3: sub_structure 2
ch4: sub-structure 3

All images are binary (0 background, 1 foreground).



Hi Michael!

Does this mean that:

  1. The large structure is – either in the real world and/or in your
    target data – exactly the union of the three substructures?

  2. The three substructures have exactly zero intersection with
    one another?

If so, would it suffice to simply segment the three substructures, with
the large structure being determined by post-processing that takes
the union of the substructures, rather than directly by the model itself?

You could argue that getting the large structure even more correct
is more important that having the large structure fully consistent
with the substructures, and that having the model segment the
large structure (in addition to the substructures) lets you do this.
But there is something to be said for at least trying the simpler

I’ll mention CrossEntropyLoss again, below, but for now: It’s perfectly
reasonable to add Dice loss to CrossEntropyLoss, but my general
advice would be to start with CrossEntropyLoss as the default, and
only add Dice loss if it gives you a clear improvement.

If you don’t need to explicitly predict the large structure (because it
can be adequately recovered as the union of the three substructures),
you will be performing a single-label, four-class (background,
substructure-1, substructure-2, substructure-3) segmentation problem.
For this you will want CrossEntropyLoss (adding Dice loss, if that
clearly helps).

Speaking in terms of a 2d, grayscale image, the input to your model
should have shape [nBatch, width, height], and the output
(which will become the input to CrossEntropyLoss) should have
shape [nBatch, nClass = 4, width, height]. From the masks
you have, you should build a single mask (per image) that consists
of integer class labels (with values 0 through 3) and has shape
[nBatch, width, height]. (Note that there is no nClass dimension.)
The integer-class-label mask will be the target for CrossEntropyLoss.

If you want to also explicitly predict your large structure, either because
it is different than the union of the substructures, or because predicting
it separately is (usefully) more accurate than taking the union, then I
would recommend performing the two predictions is a single network
as follows:

Understand the large-structure prediction as a single-label, binary
segmentation problem, and the substructure predictions as a
single-label, four-class segmentation problem (as described above).

Make your last Linear layer have five outputs, so that the model
output has shape [nBatch, 5, width, height]. Understand the
first of your five outputs to be your large-structure prediction, and
the remaining four to be your substructure predictions. Make a binary
(0-1) mask for your large structure, and make a second four-class
(as above) mask for your substructures. Peel off the first of your
five outputs and feed it into BCEWithLogitsLoss together with
your large-structure mask, and peel off the last four outputs, feeding
them into CrossEntropyLoss together with the class-label mask.
Add the two losses together (probably with some relative weight
that you tune by hand) to get a combined loss (that you call
loss.backward() on).

Your large structure and substructures are closely related, so it
makes sense to train a single model jointly on these two segmentation
tasks. Most of the upstream processing and features are shared by
the two tasks, and only the final Linear layer “learns” how to predict
the large structure and the substructures separately from the upstream

Based on my assumptions about your use case, this approach seems
to do what you most naturally want. You have both a single-label,
binary segmentation task, and a single-label, four-class segmentation
task, and this approach uses the (generally) most applicable methods
to train these two tasks.

Lastly, to clarify:

You would typically use BCEWithLogitsLoss for a multi-label,
multi-class problem, in which case your target image (mask) should
be, if you will, multi-hot encoded. That is, there can be a 1 in the
position for any number of your classes, including none and all of
them. (In this case, you would not have a background class – the
background “class” would be indicated by all 0's, that is, 0's for all
of your foreground classes.)

If you can, in fact, one-hot encoded your target – exactly one of your
classes (including the background class) is active at a time – then
you have a *single-label, multi-class problem, and, in general,
CrossEntropyLoss will be the better choice (because, in essence,
your network “knows” that it’s being trained for a single-label task).

Good luck.

K. Frank

1 Like

Hi and thanks again for a more than helpful answer!

Yes, that would be an option. Nevertheless there is a slight difference between the large structure and single structure.

Thanks! I will make a comparison…

Thats I great idea!

Just to be sure:

If I use BCEwithLogits, i do not have to use a one hot encoded label, which In case of the large structure would give me a 2 channel image (ch 0 :background, ch1:foreground). In that case a simple binary image with 1 channel would be enough? (using a Sigmoid activation function)

Otherwise I could one hot encode my image and use a Crossentropy loss instead? (using a Softmax function)?



Hi Michael!

First, to clear up some terminology:

An integer class label is a single integer that specifies the class.
These values run from 0 to nClass - 1, inclusive. Pytorch’s
CrossEntropyLoss takes integer class labels.

One-hot encoding encodes exactly the same information as an
integer class label. (You can convert back and forth between the
two without loss of information.) A one-hot-encoded value is a
vector of nClass 0's and 1's, where there is exactly one 1 and
nClass - 1 0's. The index of the location of the 1 is equal to the
integer class label.

CrossEntropyLoss does not take one-hot-encoded targets, but
instead takes integer class labels, which, to reiterate, contain exactly
the same information.

BCEWithLogitsLoss (and BCELoss) also do not take one-hot
encoded targets. This is because they are used for somewhat
different classification problems, and one-hot targets (or, equivalently,
integer class labels) don’t contain the correct information for these


And, as outlined above, you would never want to use one-hot-encoded
targets with BCEWithLogitsLoss.


But you shouldn’t use sigmoid() with BCEWithLogitsLoss, as
BCEWithLogitsLoss has, in effect, the sigmoid() built in. Just
pass the raw-score logits that are the output of your final Linear
layer as the input to BCEWithLogitsLoss. (You can use an explicit
sigmoid() to feed BCELoss, but, although mathematically equivalent
to feeding logits to BCEWithLogitsLoss, using BCELoss is
numerically less stable.)

No. As outlined above, CrossEntropyLoss takes integer class labels
rather than one-hot-encoded labels.

Again, no. CrossEntropyLoss has, in effect, softmax() built in,
so you should feed the output of your final Linear layer directly to
CrossEntropyLoss as its input (without passing it through a


K. Frank

1 Like

Hi @KFrank

Firstly, I would like to appreciate your explanations and contributions. However, I would be glad for the affirmation of gained knowledge from the thread and likewise some clarifications.

  • The difference between single label and multi-label multi-class segmentation is the presence overlapping - a case where particular corresponding pixel value(s) for each target classes is/are annotated
  • The go loss function for single label multi-class is CrossEntropyLoss while for multi-label multi-class is BCEWithLoistLoss. In any of the tasks, the logits (the final layer of the network’s output) are fed to the loss function with the stacked_target both of dimension [nBatch, nClassess, width, height]. A combination with dice loss can also be experimented.

A typical instance of this

torch.tensor([[[1., 0., 0.],
         [1., 0., 0.],
         [0., 0., 0.]],

        [[0., 1., 1.],
         [0., 1., 0.],
         [1., 1., 1.]],

        [[0., 0., 1.],
         [0., 1., 1.],
         [1., 0., 1.]]])

Is this what you referred to as multi-hot encoded?

In addressing class imbalance, my approach is to take the summed average of the 1’s in each class as the class weight to be passed into my loss function i.e [2, 6, 5] in the case of stacked_one_hot_encoded_targets above.

@KFrank Your input will be appreciated