Multi-label classification without multi-label training data

In the past I have managed to do multi-label classification with MultiLabelSoftMarginLoss, which worked well, but that was with training data that had complete set of labels (i.e. for a single image there were no missing labels).

I am now wondering what loss function to use for the case where each training image has only one label. So for example, I have the labels blue, red, green, square-shaped, curve-shaped, car, bike etc (there are about 200 labels). For the label blue I have a whole set of training images with blue items. For the label car I have another set of images with cars.

This problem is kind of an extreme case of “multi-label with missing labels” (MLML). In MLML papers the images often still have multiple labels attached. In my case there is really only 1 label per image.

I have tried to simply use BCEWithLogitsLoss. With ImageFolder I simply convert each sample to a 1-hot label vector and train with that. It works OK, but the accuracy is not very high. I think this is because if I train with a blue image that shows a blue bike I actually tell the network the image is blue (correct) but NOT a bike (wrong!).

Which loss function in pytorch is suitable for this problem or should I try another approach?

1 Like


thank you for sharing your problem!
Naïvely, I would expect it may be better to train the model only to discriminate between exclusive options if the labels allow that. So from your example, you would have [blue, red, green], [square-shaped, curve-shaped], [car, bike].
So in this view, the cross-entropy loss would penalise red and green and favour blue for images labelled blue, but leave the probabilities for unrelated labels alone. You didn’t say, but for this I would expect exclusive categories (and only those) to “participate” in a mutual softmax or so.
It’ll be very interesting to hear what works and what does not.

Best regards


Hi Thomas, thanks for your quick reply.

It’s a great idea to group together exclusive options, I have actually done some testing with that and it’s not that difficult to implement. Essentially, it means the network becomes a multi-task model, with each exclusive label group being a task. Each image trains only 1 task at a time, while ignoring the other outputs.

A new problem shows up with this approach however:

Some images simply do not belong to any of the labels within an exclusive group. A color should always be predicted, an object simply has to be long to one color. That is a good thing, because the network is trained to do always predict one color. However, let’s say I have an image of a bike and I have a group of exclusive labels called [4-doors, 2-doors, 1-door], indicating how many car doors there are in the picture. The image of a bike simply doesn’t belong to ANY of the exclusive labels, but the network will probably guess the most-common label anyway, since those outputs of that exclusive group have been trained to always predict one.

Do you see the problem I am trying to describe here? Perhaps there is always a way to structure the exclusive label groups so that there is never a group that can have zero labels predicted. Or do you have a suggestion on how to deal with this problem?


There are a few things you could do here. You basically have semi-supervised learning problem, as for any group you have some data that has a label from that group and some that doesn’t.

As an initial test I would recommend creating a dataset for which each image always has a label from each category. For instance you could take ImageNet and train a network to jointly predict both the object category and the dominant colour in the image. This removes the semi-supervised component and ensures your network can generalise to multiple tasks.

Alternatively every group could have an additional ‘unknown’ label. I’ve never used this approach in practice, so while it’s simple to add on, I don’t know how it would perform.

There has been lots of work on semi-supervised learning, for instance using good old fashioned adversarial nets.

You should be careful to distinguish between the cases where you have:

  1. The labels [blue, red] and [car, bike] and you show it a <blue, truck> instance --> what is appropriate here?
  2. The labels [blue, red] and [car, bike, truck] but your particular instance is a blue truck that is only labelled blue. This is the semi supervised case as the network should induce, from other labelled trucks, that while this is blue it is also a truck.

In case (1) you can really only structure it so that you either have a guaranteed label from every group, or you let the network choose a possible ‘unknown’ option for every group.

Sorry if I’ve only repeated your question and not added any value. This is an interesting problem!

  • Jordan