Multi Label Classification in pytorch

SpandanMadan · August 8, 2017, 5:50am

Hi Soumith,

I see that a bunch of people feel multi label classification is important and don’t have the details figured out. I can build an example based off of the code I wrote for my research. Is the standard way to fork the git repo and request merge?

Best,
Spandan

AjayTalati · August 19, 2017, 8:40am

Hi Spandan,

that would be a great thing to help the community Good working examples are always warmly appreciated.

Best,
Ajay

ahkarami · October 17, 2017, 10:29am

Dear @mratsim
I have an extremely large-scale multi-label data set (with about 12M images and 11K labels). Would you please kindly, guide me what is the best way to represent each sample with its corresponding labels? (with the best Multi-GPU utilization and data loading efficiency)
Thank you

mratsim · October 17, 2017, 5:11pm

Hey @ahkarami, I’m sorry I never processed data on such scale (yet ) and without playing with the data and your IT architecture I would have trouble to help you there.

Here is how I would go:

Get as much RAM as you can, get SSDs as well.
Load the data on the fly with multiple workers so that the CPU can feed your data as fast as the GPUs process it.
Have a look into PyTorch Distributed: http://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html
If data storage or storage of numpy array is an issue after preprocessing, look into bcolz for in-memory or on-disk compressed numpy compatible arrays. I wrote an article on that here but I only had 160 GB of images to process.

For the multi-GPU side, you will probably have to summon one of PyTorch core devs.

SpandanMadan · October 17, 2017, 9:19pm

One way to do this is to not load everything into the dataloader, and just write one which assembles the labels on the fly. Say your GPU would handle something like 150 images in one go. So it needs 150 vectors of length 11K in one go, as each image’s label can be binarized [1,0,0,0,1…] (1 if the image has that label and 0 if it doesn’t.)

First, create a dictionary of image names to it’s labels and store it in a dictionary using python pickle. Let’s call this pickle file ‘image_name_to_label_vector.pckl’.
Now, you can create a new data loader like this. All I’ve changed from the original data loader is the get_item function, where I’m loading the labels on the fly from this dictionary. Simple!

IN YOUR PYTORCH FILE, add the new data loader -

from torchvision import datasets, models, transforms
import matplotlib.pyplot as plt
from ImageFolder_new import ImageFolder_spandan

DATA LOADER (save as ImageFolder_new.py) -

import torch.utils.data as data
import pickle
import numpy as np
from PIL import Image
import os
import os.path
import torch

IMG_EXTENSIONS = [
    '.jpg', '.JPG', '.jpeg', '.JPEG',
    '.png', '.PNG', '.ppm', '.PPM', '.bmp', '.BMP',
]

f = open('image_name_to_label_vector.pckl','rb')
image_name_to_label_vector = pickle.load(f)
f.close()

def is_image_file(filename):
    return any(filename.endswith(extension) for extension in IMG_EXTENSIONS)


def find_classes(dir):
    classes = [d for d in os.listdir(dir) if os.path.isdir(os.path.join(dir, d))]
    classes.sort()
    class_to_idx = {classes[i]: i for i in range(len(classes))}
    return classes, class_to_idx


def make_dataset(dir, class_to_idx):
    images = []
    concept_or_tag_features = []
    dir = os.path.expanduser(dir)
    for target in sorted(os.listdir(dir)):
        d = os.path.join(dir, target)
        if not os.path.isdir(d):
            continue

        for root, _, fnames in sorted(os.walk(d)):
            for fname in sorted(fnames):
                if is_image_file(fname):
                    path = os.path.join(root, fname)
                    item = (path, class_to_idx[target])
                    images.append(item)

    return images


def pil_loader(path):
    # open path as file to avoid ResourceWarning (https://github.com/python-pillow/Pillow/issues/835)
    with open(path, 'rb') as f:
        with Image.open(f) as img:
            image_converted = img.convert('RGB')
            return image_converted


def accimage_loader(path):
    import accimage
    try:
        return accimage.Image(path)
    except IOError:
        # Potentially a decoding problem, fall back to PIL.Image
        return pil_loader(path)


def default_loader(path):
    from torchvision import get_image_backend
    if get_image_backend() == 'accimage':
        return accimage_loader(path)
    else:
        return pil_loader(path)


class ImageFolder_spandan(data.Dataset):
    """A generic data loader where the images are arranged in this way: ::
        root/dog/xxx.png
        root/dog/xxy.png
        root/dog/xxz.png
        root/cat/123.png
        root/cat/nsdf3.png
        root/cat/asd932_.png
    Args:
        root (string): Root directory path.
        transform (callable, optional): A function/transform that  takes in an PIL image
            and returns a transformed version. E.g, ``transforms.RandomCrop``
        target_transform (callable, optional): A function/transform that takes in the
            target and transforms it.
        loader (callable, optional): A function to load an image given its path.
     Attributes:
        classes (list): List of the class names.
        class_to_idx (dict): Dict with items (class_name, class_index).
        imgs (list): List of (image path, class_index) tuples
    """

    def __init__(self, root, transform=None, target_transform=None,
                 loader=default_loader):
        classes, class_to_idx = find_classes(root)
        imgs = make_dataset(root, class_to_idx)
        if len(imgs) == 0:
            raise(RuntimeError("Found 0 images in subfolders of: " + root + "\n"
                               "Supported image extensions are: " + ",".join(IMG_EXTENSIONS)))

        self.root = root
        self.imgs = imgs
        self.classes = classes
        self.class_to_idx = class_to_idx
        self.transform = transform
        self.target_transform = target_transform
        self.loader = loader

    def __getitem__(self, index):
        """
        Args:
            index (int): Index
        Returns:
            tuple: (image, target) where target is class_index of the target class.
        """
        
        path, target = self.imgs[index]
        img = self.loader(path)
        if self.transform is not None:
            img = self.transform(img)
        if self.target_transform is not None:
            target = self.target_transform(target)
        name = path.split('/')[-1]
        label = image_name_to_label_vector[name]
        return img,label
        
    def __len__(self):
        return len(self.imgs)

That handles your data loading without anything too fancy. If you have the resources to parallelise this, feel free to use DataParallel!

ahkarami · October 18, 2017, 6:30am

Dear @mratsim & @SpandanMadan,
I am really appropriate for useful hints and guidance. I will use the presented methods.
Thank you very much

ahkarami · October 27, 2017, 11:24am

Dear @mratsim & @SpandanMadan,
I have another question. One of the well-known Multi-Label Classification methods is using the Sigmoid Cross Entropy Loss (which we can add an F.sigmoid() layer at the end of our CNN Model and after that use for example nn.BCELoss()). Now, my question is that it is better to plug the F.sigmoid() layer at the end of our CNN Model in the training process or instead not use F.sigmoid() in the training process and after the networks’ parameters learned we just use the F.sigmoid() layer for the result of the network? (i.e., incorporate the Sigmoid layer in the structure of our Net or not?)

mratsim · October 28, 2017, 9:24am

Let’s take ResNet finetuning as an example:

class ResNet50(nn.Module):
    def __init__(self, num_classes):
        super(ResNet50, self).__init__()
        
        # Loading ResNet arch from PyTorch
        original_model = models.resnet50(pretrained=True)
        
        # Everything except the last linear layer
        self.features = nn.Sequential(*list(original_model.children())[:-1])
        
        # Get number of features of last layer
        num_feats = original_model.fc.in_features
        
        # Plug our classifier
        self.classifier = nn.Sequential(
        nn.Linear(num_feats, num_classes)
        )
        
        # Init of last layer
        for m in self.classifier:
            kaiming_normal(m.weight)

        # Freeze all weights except the last classifier layer
        # for p in self.features.parameters():
        #     p.requires_grad = False

    def forward(self, x):
        f = self.features(x)
        f = f.view(f.size(0), -1)
        y = self.classifier(f)
        return y

Is your question regarding using sigmoid here? :

    def forward(self, x):
        f = self.features(x)
        f = f.view(f.size(0), -1)
        y = self.classifier(f)
        y = F.sigmoid(y) # Is this better ?
        return y

Or at the level higher ?
Currently there is no difference.

Ideally, in the future you should use MultiLabelSoftMarginLoss during training once it is numerically stable and faster, see PyTorch issue 1516

Currently MultiLabelSoftMarginLoss in PyTorch is implemented in the naive way Sigmoid + Cross-Entropy separate pass while if it were fused it would be faster and more accurate.

The proper way is to use the log-sum-exp trick to simplify Sigmoid Cross Entropy (SCE) expression from this (after naive replacement of sigmoid into cross-entropy function):

SCE(x, y') = − 1/n ∑i(ti * (xi - ln(1 + e^xi)) + (1−ti) * -ln(1 + e^xi) )
ti (read target_i) being elements of y’

to this

SCE(x, y') = − 1/n ∑i(ti * xi - max(xi,0) - ln(1 + e^-|xi|) this is more numerically stable and much faster to compute.

Full explanation of each simplification steps in my own PyTorch-like framework here

Note: ln(1 + x) is also numerically instable if x << 1 (very inferior to 1), 1 + x will be simplified to 1 and ln(1) gives a result of 0 (catastrophic cancellation), even though when x is small ln(1 + x) ~= x, which means the network will wrongly stop training because no gradient. Numpy has the log1p function to avoid that but I don’t think PyTorch has it.

ahkarami · October 28, 2017, 8:52pm

Dear @mratsim,
Thank you very much for your complete and useful response. I have founded that if I used the Sigmoid layer in the network (i.e., your second script in the above message) so the BCE Loss will be same as the MultiLabelSoftMarginLoss.
Another thing is that, I also wondered to see your fantastic Arraymancer Library. Can we use it in PyTorch? I mean that how we can use the simplify Sigmoid Cross Entropy (SCE) Loss in PyTorch?

mratsim · November 5, 2017, 4:01pm

For now there are no Python bindings so we can’t.

Dimitrisl · November 6, 2017, 11:47am

Hello!This would be a life-saver for me as well…Could you show us an example of how you pass the multiple labels in the loss function?

Thanks in advance!

Luoshang_Lowson_Pan · November 27, 2017, 4:47pm

I find it confusing to use Sigmoid(output) > 0.5 as labeling criteria.
As in sigmoid function, when the output == 0, Sigmoid(output) == 0.5
And in your example, all the labels is either 0 or 1.
Thus, the negative samples will converge to 0 and in a range of [0 - a, 0 + a].
And, using Sigmoid(output) > 0.5 will make all negative sample in (0, 0 + a] labeled as positive.

Luckick · December 21, 2017, 4:26pm

I did a same project which has 12 labels to classify. And for each observation, it could belong to multiple labels or none.
In my case, the positive case is very little. In labels, the number of 0 is 49 times the number of 1.
I used BCELoss with weight, by setting the weight for positive case 49 times the weight for negative case. And finally I got 0.85 AUC on average.
Just treat the multi-label as unbalanced data.

Yongfei_Liu · December 22, 2017, 2:42am

hello, SpandanMadan! recently, i have some trouble in doing multi-lable classification tasks, can you give me your example repo of classification? Thanks

jlro · February 27, 2018, 12:43pm

I chose MultiLabelMarginLoss as loss function, but in the training phase, the output changed oddly.
The first column become extremely large than 1, while the data in other columns become much less than 1.

my core codes are as follows:

criterion = nn.MultiLabelMarginLoss()
optimizer = optim.SGD(mynet.parameters(), lr=5e-3)

Do you know why?
Or you can show me the right demo?

Rohit_Kumar_Singh · July 4, 2018, 2:57pm

Hi Luckick,

How did you assign weights to positive(1) and negative(0) cases? In BCELoss() we can assign weights to each label right? Please correct me if I am wrong.

Luckick · July 4, 2018, 5:24pm

There is a parameter in BCELoss(), weight (Tensor, optional) – a manual rescaling weight given to the loss of each batch element. If given, has to be a Tensor of size “nbatch”.
So it is not to assigned to each label, but you can set it to each input according to the proportion of labels.

Rohit_Kumar_Singh · July 4, 2018, 5:51pm

So you pass a weight tensor of same shape as target with only two unique values in it ? And those two values lets say w1 and w2 will be at the same position in weight tensor as 1 and 0 respectively are in target tensor?

Luckick · July 5, 2018, 12:44am

Yes, you are right…

lyan62 · July 26, 2018, 2:46pm

can anyone share how they calculate accuracy or evaluate the performance for multi-label?