Triplet vs Cross entropy loss for multi-label classification

Hi, this is a general question about multi-label classification I have been thinking about:

Multi-label classification for < 200 labels can be done in many ways, but here I consider two options:

  1. CNN (e.g. Resnet, VGG) + Cross entropy loss, the traditional approach, the final layer contains the same number of nodes as there are labels. Samples are taken randomly and compared to the known labels.

  2. CNN (e.g. Resnet, VGG) + Triplet loss, where the final layer is a “feature vector”. In the end the classes can be found with KNN or a separate single-layer network turning the feature vector into a class. Triplets are constructed randomly with the anchor and positive from the same label, and the negative from another label.

Does the second option (triplet network) make sense if the only goal in the end is to classify images? Can we expect similar performance for the same network trained with simply cross entropy loss? Is is just over-complicated to use triplets in this scenario?

Additionally, does the best choice change if we want to do multiclass classification instead of multi-label?


I’m not sure we’re using terminology the same way:

multi-class: a sample can be one of N classes where N>2 (e.g. ImageNet1k)
multi-label: a sample can be labeled with more than one class (e.g. youtube8m video classification)

  1. For multi-class (one ground truth label), softmax + cross entropy loss works well.
    Use F.cross_entropy / nn.CrossEntropyLoss

  2. For multi-label (multiple ground truth labels), sigmoid + cross entropy per label works well.
    Use F. binary_cross_entropy_with_logits / nn.BCEWithLogitsLoss

  3. I haven’t used triplet loss, but I’ve seen it used in open-ended classification problems like OpenFace. For ~200 labels, I suspect sigmoid + cross entropy will work better.

I’ve also seen the multi-label case treated as multi-class. Each time you sample, you treat one “ground truth” label as a positive everything else as negative. You can then use softmax instead of independent classifiers. This paper does that for learning features from flickr100m hash tags. My guess is that if your metric is something like average precision (instead of good features for transfer learning), then (2) will work better.


Thanks for the quick reply @colesbury!

I swapped multi-class and multi-label, sorry for the confusion, I indeed meant multi-class as the main case to consider, and multi-label as an extended problem.

If I consider ImageNet1k (multi-class problem), I can use the following 2 models to achieve a good classifier:

A) model based on triplet loss + KNN or 1-layer network to determine class
B) model with softmax + cross entropy

Additionally, I could even take the before-last layer for model B to get a feature vector instead of a class to do some ranking (images with similar features ending up in the same cluster).

So now I can conclude these two models can do the same two things: classifying and ranking. I wonder how they compare:

  • If it is not an open-ended problem, is it still a good idea to use triplets for learning features instead of the simpler model A?
  • Can I expect better ranking of the feature vector by the triplet model A instead of the vector I can take from the before-last (or so) layer from model B?

For classification accuracy, I strongly suspect model B would work better.

I’m less familiar with ranking, but I’ve gotten qualitatively good results by training with softmax+cross entropy and doing KNN on the features (the before-last layer). If you have time, you might want to try both. looks like it uses triplet loss for ranking images.

OK, yes I am familiar with that paper, but that model has a catch: the triplets are not just randomly sampled from classes, but instead there are humans raters involved making sure the anchor and positive are truly similar. So that is a lot of extra work compared to simply training a softmax + cross entropy model + KNN.

For example, in their Figure 1 there is a triplet of 3 lamps, let’s say all three are from the same “lamp” class. Human raters say the positive and anchor are more similar than the negative. This is not what you get if you simply construct triplets at random, taking the anchor and positive from the same class and the negative from another class entirely.

With human-raters forming triplets, I suspect the triplet ranking would be better, but for simple random triplets, the ranking might perform similarly to what you get from the before-last layer of a softmax+cross entropy.

I guess I just have to try both and see, thanks for helping out!

Dear @colesbury & @bartolsthoorn
I have an extremely large-scale multi-label data set (with about 12M images and 11K labels). Would you please kindly, guide me what is the best way to represent each sample with its corresponding labels? (with the best Multi-GPU utilization and data loading efficiency)
Thank you