Multi-label classification return more than 1 class in each label

How to train a Multi-label classification model when each label should return more than 1 class?
Image classification have 2 label: style with 4 classes and layout with 5 classes.
An image in list should return 2 style and 3 layout like [1 0 1 0] [1 1 0 0 1]

Hi Tùng!

The simplest approach would be to recognize that you have a
multi-label, nine-class problem. The fact that the nine classes can
be – conceptually – grouped together into a four-class style group
and a five-class layout group is not something that you need to build
into your network architecture.

So a straightforward approach would be to have the final layer of
your model be a Linear with out_features = 9. You would typically
use BCEWithLogitsLoss. The output of your model would have
shape [nBatch, nClass = 9] and your label (the target you pass
to BCEWithLogitsLoss) would have the same shape. Thus, for
nBatch = 1 your example label would be:

torch.tensor ([[1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0]])

(Note the leading nBatch = 1 (singleton) dimension.)


K. Frank

Thank you. But when an image return a Dog and a Cat at the same time, what model learn from this?

Hi Tùng!!

In a multi-label, multi-class classification task you expect samples
to have more than one label be active at the same time.

Following your example, let’s say that you have three classes:
Dog, Cat, and Horse. And let’s say that your samples are images
that can contain any (including none) of those three animals at the
same time. If you have an image where a dog appears on the left
and a cat appears on the right, your ground-truth multi-label label
would be “Yes-Dog, Yes-Cat, No-Horse” (e.g., [1, 1, 0]), and
you would want to train your classifier to predict both a Dog and
a Cat at the same time (and no Horse) for that image.


K. Frank

This is my model and loss function for 2 labels: style and layout, each have more than 10 classes. Is it true? How can i improve it?

class MyModel(nn.Module):
    def __init__(self, n__classes=32):
        self.base_model = models.resnet50(pretrained=True).to(device)
        last_channel = self.base_model.fc.in_features
        self.base_model.fc = nn.Sequential()
        self.layout = nn.Sequential(
            nn.Linear(last_channel, n_classes_layout),
        ) = nn.Sequential(
            nn.Linear(last_channel, n_classes_style),
    def forward(self, x):
        base = self.base_model(x)
        return self.layout(base),

def loss_fn(outputs, targets):
    o1, o2 = outputs
    t1, t2 = targets
    l1 = nn.BCELoss()(o1, t1)
    l2 = nn.BCELoss()(o2, t2)
    return (l1 + l2) / 2

Hi Tùng!

First comment: For reasons of numerical stability, you should use
BCEWithLogitsLoss and get rid of the final Sigmoid layers.

Other than that, your approach looks reasonable.

However, you don’t need to split your final layer into separate layout
and style pieces. The following is (essentially*) equivalent:

        self.layout_and_style_together = nn.Sequential(
            nn.Linear(last_channel, n_classes_layout + n_classes_style),

*) There is a difference in that your version has two separate
Dropouts. Therefore the choice of which inputs to randomly zero
out is made separately for your layout and style layers, where
the same random choice is used for both the “layout” and “style”
sections of my layout_and_style_together layer. (I can’t imagine
that this matters at all.)

Also, if you are willing to forgo the Dropout (and do the right thing by
removing the Sigmoid), you can simply replace resnet50's final layer
with a single Linear:

        self.base_model = models.resnet50 (pretrained=True).to (device)
        last_channel = self.base_model.fc.in_features
        self.base_model.fc = nn.Linear (last_channel, n_classes_layout + n_classes_style)

(I don’t have an opinion about whether your final Dropout is helpful
or necessary. However, the resnet50 architecture doesn’t have one.)


K. Frank

Thank you so much. I have some deeper questions:

  • When i should split my output and when i should combine them like your way?
  • I think my model will be better if i know the relation between output labels, is it true? And how to research that relations?
  • And why i shounldn’t use Sigmoid layer?

Sorry for my bad English.

Hi Tùng!

It really doesn’t matter – it is a purely stylistic choice. The two versions
are completely* equivalent. In my version there is a single Linear
whose weight has shape:

[n_classes_layout + n_classes_style, last_channel],

while your version has two Linears with weights with shapes:

[n_classes_layout, last_channel],


[n_classes_layout + n_classes_style, last_channel].

But it’s the exact same matrix multiplying the exact same input tensor,
but in your version the matrix and computation are just broken up into
two pieces.

*) The two versions do actually differ in that your version has two
sets of random Dropout choices, but I think that this difference is

When your labels have structure, it is sometimes be better to build
that structure into the architecture of your model so that your model
doesn’t have to “learn” the structure. But it is also sometimes better
not to build the structure into your model, and, instead, rely on your
model to “learn” the structure. It depends on your use case.

By way of example, suppose you have a multi-label, ten-class
problem, and you happen to know that you will never have more
than eight labels active at the same time. That is, your multi-label
label could have any number from zero to eight of the labels active
at the same time, but never nine or ten. You would be better off
not trying to build that structure into your model – just train your
model well, and it won’t (very often) predict more than eight labels
being active at the same time.

On the other hand, suppose you have four classes, A, B, C, and D,
and you have a multi-label problem where you know that exactly
two of the labels will be active at the same time. You may well be
better off recasting your problem as a single-label, six-class problem
where the six classes are AB, AC, AD, BC, BD, and CD. I could
well imagine this version would train more easily because it doesn’t
have to “learn” that multi-labels such as B or ACD or no label at
all never occur.

Sigmoid followed by BCELoss is mathematically equivalent to
BCEWithLogitsLoss, but numerically it is less stable. Internally
BCEWithLogitsLoss combines the Sigmoid with a log() function
into a log_sigmoid() implementation that uses the log-sum-exp trick
for improved numerical stability.


K. Frank

Thank you. Is there any paper about constrain between labels in multi-labels image classification?