Multi Label Classification in pytorch

Hi Soumith,

I see that a bunch of people feel multi label classification is important and don’t have the details figured out. I can build an example based off of the code I wrote for my research. Is the standard way to fork the git repo and request merge?



Hi Spandan,

that would be a great thing to help the community :wink: Good working examples are always warmly appreciated.


1 Like

Dear @mratsim
I have an extremely large-scale multi-label data set (with about 12M images and 11K labels). Would you please kindly, guide me what is the best way to represent each sample with its corresponding labels? (with the best Multi-GPU utilization and data loading efficiency)
Thank you

1 Like

Hey @ahkarami, I’m sorry I never processed data on such scale (yet :wink: ) and without playing with the data and your IT architecture I would have trouble to help you there.

Here is how I would go:

  • Get as much RAM as you can, get SSDs as well.
  • Load the data on the fly with multiple workers so that the CPU can feed your data as fast as the GPUs process it.
  • Have a look into PyTorch Distributed:
  • If data storage or storage of numpy array is an issue after preprocessing, look into bcolz for in-memory or on-disk compressed numpy compatible arrays. I wrote an article on that here but I only had 160 GB of images to process.

For the multi-GPU side, you will probably have to summon one of PyTorch core devs.


One way to do this is to not load everything into the dataloader, and just write one which assembles the labels on the fly. Say your GPU would handle something like 150 images in one go. So it needs 150 vectors of length 11K in one go, as each image’s label can be binarized [1,0,0,0,1…] (1 if the image has that label and 0 if it doesn’t.)

First, create a dictionary of image names to it’s labels and store it in a dictionary using python pickle. Let’s call this pickle file ‘image_name_to_label_vector.pckl’.
Now, you can create a new data loader like this. All I’ve changed from the original data loader is the get_item function, where I’m loading the labels on the fly from this dictionary. Simple!

IN YOUR PYTORCH FILE, add the new data loader -

from torchvision import datasets, models, transforms
import matplotlib.pyplot as plt
from ImageFolder_new import ImageFolder_spandan

DATA LOADER (save as -

import as data
import pickle
import numpy as np
from PIL import Image
import os
import os.path
import torch

    '.jpg', '.JPG', '.jpeg', '.JPEG',
    '.png', '.PNG', '.ppm', '.PPM', '.bmp', '.BMP',

f = open('image_name_to_label_vector.pckl','rb')
image_name_to_label_vector = pickle.load(f)

def is_image_file(filename):
    return any(filename.endswith(extension) for extension in IMG_EXTENSIONS)

def find_classes(dir):
    classes = [d for d in os.listdir(dir) if os.path.isdir(os.path.join(dir, d))]
    class_to_idx = {classes[i]: i for i in range(len(classes))}
    return classes, class_to_idx

def make_dataset(dir, class_to_idx):
    images = []
    concept_or_tag_features = []
    dir = os.path.expanduser(dir)
    for target in sorted(os.listdir(dir)):
        d = os.path.join(dir, target)
        if not os.path.isdir(d):

        for root, _, fnames in sorted(os.walk(d)):
            for fname in sorted(fnames):
                if is_image_file(fname):
                    path = os.path.join(root, fname)
                    item = (path, class_to_idx[target])

    return images

def pil_loader(path):
    # open path as file to avoid ResourceWarning (
    with open(path, 'rb') as f:
        with as img:
            image_converted = img.convert('RGB')
            return image_converted

def accimage_loader(path):
    import accimage
        return accimage.Image(path)
    except IOError:
        # Potentially a decoding problem, fall back to PIL.Image
        return pil_loader(path)

def default_loader(path):
    from torchvision import get_image_backend
    if get_image_backend() == 'accimage':
        return accimage_loader(path)
        return pil_loader(path)

class ImageFolder_spandan(data.Dataset):
    """A generic data loader where the images are arranged in this way: ::
        root (string): Root directory path.
        transform (callable, optional): A function/transform that  takes in an PIL image
            and returns a transformed version. E.g, ``transforms.RandomCrop``
        target_transform (callable, optional): A function/transform that takes in the
            target and transforms it.
        loader (callable, optional): A function to load an image given its path.
        classes (list): List of the class names.
        class_to_idx (dict): Dict with items (class_name, class_index).
        imgs (list): List of (image path, class_index) tuples

    def __init__(self, root, transform=None, target_transform=None,
        classes, class_to_idx = find_classes(root)
        imgs = make_dataset(root, class_to_idx)
        if len(imgs) == 0:
            raise(RuntimeError("Found 0 images in subfolders of: " + root + "\n"
                               "Supported image extensions are: " + ",".join(IMG_EXTENSIONS)))

        self.root = root
        self.imgs = imgs
        self.classes = classes
        self.class_to_idx = class_to_idx
        self.transform = transform
        self.target_transform = target_transform
        self.loader = loader

    def __getitem__(self, index):
            index (int): Index
            tuple: (image, target) where target is class_index of the target class.
        path, target = self.imgs[index]
        img = self.loader(path)
        if self.transform is not None:
            img = self.transform(img)
        if self.target_transform is not None:
            target = self.target_transform(target)
        name = path.split('/')[-1]
        label = image_name_to_label_vector[name]
        return img,label
    def __len__(self):
        return len(self.imgs)

That handles your data loading without anything too fancy. If you have the resources to parallelise this, feel free to use DataParallel!

1 Like

Dear @mratsim & @SpandanMadan,
I am really appropriate for useful hints and guidance. I will use the presented methods.
Thank you very much

Dear @mratsim & @SpandanMadan,
I have another question. One of the well-known Multi-Label Classification methods is using the Sigmoid Cross Entropy Loss (which we can add an F.sigmoid() layer at the end of our CNN Model and after that use for example nn.BCELoss()). Now, my question is that it is better to plug the F.sigmoid() layer at the end of our CNN Model in the training process or instead not use F.sigmoid() in the training process and after the networks’ parameters learned we just use the F.sigmoid() layer for the result of the network? (i.e., incorporate the Sigmoid layer in the structure of our Net or not?)

Let’s take ResNet finetuning as an example:

class ResNet50(nn.Module):
    def __init__(self, num_classes):
        super(ResNet50, self).__init__()
        # Loading ResNet arch from PyTorch
        original_model = models.resnet50(pretrained=True)
        # Everything except the last linear layer
        self.features = nn.Sequential(*list(original_model.children())[:-1])
        # Get number of features of last layer
        num_feats = original_model.fc.in_features
        # Plug our classifier
        self.classifier = nn.Sequential(
        nn.Linear(num_feats, num_classes)
        # Init of last layer
        for m in self.classifier:

        # Freeze all weights except the last classifier layer
        # for p in self.features.parameters():
        #     p.requires_grad = False

    def forward(self, x):
        f = self.features(x)
        f = f.view(f.size(0), -1)
        y = self.classifier(f)
        return y

Is your question regarding using sigmoid here? :

    def forward(self, x):
        f = self.features(x)
        f = f.view(f.size(0), -1)
        y = self.classifier(f)
        y = F.sigmoid(y) # Is this better ?
        return y

Or at the level higher ?
Currently there is no difference.

Ideally, in the future you should use MultiLabelSoftMarginLoss during training once it is numerically stable and faster, see PyTorch issue 1516

Currently MultiLabelSoftMarginLoss in PyTorch is implemented in the naive way Sigmoid + Cross-Entropy separate pass while if it were fused it would be faster and more accurate.

The proper way is to use the log-sum-exp trick to simplify Sigmoid Cross Entropy (SCE) expression from this (after naive replacement of sigmoid into cross-entropy function):

SCE(x, y') = − 1/n ∑i(ti * (xi - ln(1 + e^xi)) + (1−ti) * -ln(1 + e^xi) )
ti (read target_i) being elements of y’

to this

SCE(x, y') = − 1/n ∑i(ti * xi - max(xi,0) - ln(1 + e^-|xi|) this is more numerically stable and much faster to compute.

Full explanation of each simplification steps in my own PyTorch-like framework here

Note: ln(1 + x) is also numerically instable if x << 1 (very inferior to 1), 1 + x will be simplified to 1 and ln(1) gives a result of 0 (catastrophic cancellation), even though when x is small ln(1 + x) ~= x, which means the network will wrongly stop training because no gradient. Numpy has the log1p function to avoid that but I don’t think PyTorch has it.


Dear @mratsim,
Thank you very much for your complete and useful response. I have founded that if I used the Sigmoid layer in the network (i.e., your second script in the above message) so the BCE Loss will be same as the MultiLabelSoftMarginLoss.
Another thing is that, I also wondered to see your fantastic Arraymancer Library. Can we use it in PyTorch? I mean that how we can use the simplify Sigmoid Cross Entropy (SCE) Loss in PyTorch?

1 Like

For now there are no Python bindings so we can’t.

Hello!This would be a life-saver for me as well…Could you show us an example of how you pass the multiple labels in the loss function?

Thanks in advance!

I find it confusing to use Sigmoid(output) > 0.5 as labeling criteria.
As in sigmoid function, when the output == 0, Sigmoid(output) == 0.5
And in your example, all the labels is either 0 or 1.
Thus, the negative samples will converge to 0 and in a range of [0 - a, 0 + a].
And, using Sigmoid(output) > 0.5 will make all negative sample in (0, 0 + a] labeled as positive.

I did a same project which has 12 labels to classify. And for each observation, it could belong to multiple labels or none.
In my case, the positive case is very little. In labels, the number of 0 is 49 times the number of 1.
I used BCELoss with weight, by setting the weight for positive case 49 times the weight for negative case. And finally I got 0.85 AUC on average.
Just treat the multi-label as unbalanced data.

1 Like

hello, SpandanMadan! recently, i have some trouble in doing multi-lable classification tasks, can you give me your example repo of classification? Thanks

I chose MultiLabelMarginLoss as loss function, but in the training phase, the output changed oddly.
The first column become extremely large than 1, while the data in other columns become much less than 1.

my core codes are as follows:

criterion = nn.MultiLabelMarginLoss()
optimizer = optim.SGD(mynet.parameters(), lr=5e-3)

Do you know why?
Or you can show me the right demo?

Hi Luckick,

How did you assign weights to positive(1) and negative(0) cases? In BCELoss() we can assign weights to each label right? Please correct me if I am wrong.

There is a parameter in BCELoss(), weight (Tensor, optional) – a manual rescaling weight given to the loss of each batch element. If given, has to be a Tensor of size “nbatch”.
So it is not to assigned to each label, but you can set it to each input according to the proportion of labels.

So you pass a weight tensor of same shape as target with only two unique values in it ? And those two values lets say w1 and w2 will be at the same position in weight tensor as 1 and 0 respectively are in target tensor?

Yes, you are right…

can anyone share how they calculate accuracy or evaluate the performance for multi-label?