What kind of loss is better to use in multilabel classification?

ViniLL · December 15, 2018, 7:59am

I am currently working on my mini-project, where I predict movie genres based on their posters. So in the dataset that I have, each movie can have from 1 to 3 genres, therefore each instance can belong to multiple classes. I have total of 15 classes(15 genres). I use mini-batch of 4.When I train my classifier, my labels is a list of 3 elements and it looks like that:

tensor([[ 2., 10., 5.],
[ 2., 5., 0.],
[14., 0., 0.],
[ 1., 0., 0.]]) , where 0 means there is no genre for that position available

and my output at the last stage is

tensor([[-0.0968, -0.0381, -0.0629, -0.0519, 0.1343, -0.0395, 0.0480, -0.0035,
0.0559, -0.0791, 0.0652, 0.0573, -0.0751, 0.0459, -0.0035],
[-0.0978, -0.0385, -0.0551, -0.0518, 0.1312, -0.0432, 0.0539, 0.0017,
0.0460, -0.0868, 0.0627, 0.0534, -0.0666, 0.0420, 0.0013],
[-0.0939, -0.0549, -0.0444, -0.0664, 0.1229, -0.0561, 0.0458, 0.0021,
0.0328, -0.0869, 0.0710, 0.0462, -0.0734, 0.0459, 0.0065],
[-0.0916, -0.0274, -0.0734, -0.0436, 0.1443, -0.0329, 0.0525, -0.0043,
0.0679, -0.0738, 0.0639, 0.0557, -0.0754, 0.0459, -0.0087]],

my total number of genres is 15 , therefore my last fully connected layer gives me the output of a list with 15 weights. But now, the problem is, I don’t know which Loss Function to choose here, so it would properly calculate loss of my problem. I tried CrossEntropy, but it does not work since it does not support multilabeling problem. ( multi-target not supported at c:\new-builder_3\win-wheel\pytorch\aten\src\thnn\generic/ClassNLLCriterion.c:21)

I also tried nn.MultiLabelSoftMarginLoss(),
but here the problem is that the numbers of elemetns in my output and target do not match t, 60 != 12… S
o I am wondering what would be the best LossFunction in this case and how to implement it…
Currently trainin part in my code looks like that:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 10, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(10, 20, 5)
        self.fc1 = nn.Linear(20 * 22 * 39, 100)
        self.fc2 = nn.Linear(100, 50)
        self.fc3 = nn.Linear(50, 10)
        self.fc4 = nn.Linear(10,3)

def forward(self, x):
    x = x.view(-1, 3, 100, 170)
    x = self.pool(F.relu(self.conv1(x)))
    x = self.pool(F.relu(self.conv2(x)))
    x = x.view(-1, 20 * 22 * 39)
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = F.relu(self.fc3(x))
    x = self.fc4(x)
    #print(x)
    return x

net = Net()

import torch.optim as optim

criterion = nn.MultiLabelSoftMarginLoss()#nn.CrossEntropyLoss()#BCEWithLogitsLoss()#
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

########################################################################

4. Train the network

^^^^^^^^^^^^^^^^^^^^

for epoch in range(4): # loop over the dataset multiple times

losses = []

running_loss = 0.0
for i, data in enumerate(trainloader, 0):
    # get the inputs
    inputs, labels = data
    inputs = inputs.float()
    labels = labels.float()

    # zero the parameter gradients
    optimizer.zero_grad()

    # forward + backward + optimize
    outputs = net(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    # print statistics
    running_loss += loss.item()
    if i % 200 == 199:    # print every 2000 mini-batches
        print('[%d, %5d] loss: %.3f' %
              (epoch + 1, i + 1, running_loss / 200))
        running_loss = 0.0

print('Finished Training')

ptrblck · December 16, 2018, 7:10pm

You could try to transform your target to a multi-hot encoded tensor, i.e. each active class has a 1 while inactive classes have a 0, and use nn.BCEWithLogitsLoss as your criterion.
Your target would thus have the same shape as your model output.
This worked pretty well in the past for me.

ViniLL · December 16, 2018, 7:27pm

Hi @ptrblck , thanks for taking a look at my problem, could you provide example of how to do this transformation?
I also resized the number of my labels since last question due to high imbalance in my target data. So now, each label has 6 genres and looks as follows :
tensor([1, 4, 1, 0, 5, 2])
and the output of my model looks like that :
tensor([[-0.0372, -0.0156, -0.0152, 0.0168, -0.0080, 0.0074],
[-0.0337, -0.0016, -0.0026, -0.0089, -0.0027, 0.0187]],

I am not sure how do some classes active while other inactive, could you give a hint or provide example? I know, that nn.BCEWithLogitsLoss has to be followed by sigmoid as activation function, but I am not sure what’s the best way to use it in my case.

ptrblck · December 16, 2018, 7:32pm

Sure!
Given your example target, you could use scatter to create the multi-hot target:

labels = torch.tensor([1, 4, 1, 0, 5, 2])
labels = labels.unsqueeze(0)
target = torch.zeros(labels.size(0), 15).scatter_(1, labels, 1.)
print(target)
> tensor([[1., 1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

nn.BCEWithLogitsLoss takes the raw logits of your model (without any non-linearity) and applies the sigmoid internally.
If you would like to add the sigmoid activation to your model, you should use nn.BCELoss instead.

ViniLL · December 16, 2018, 7:49pm

Could you explain little more in details this line of your code:
target = torch.zeros(labels.size(0), 15).scatter_(1, labels, 1.)
Why did you choose dimension of 1 in 1st argument of scatter function and where number 15 came from ?
Also, I don’t need to change the format of my data outputs right?
And how to measure loss during the loop of my iterations? Will the following metric be good for that?


losses =  [ ]
loss = loss_fn (prediction, to_variable(target))  # Compute losses
            loss.backward()  # Backpropagate the gradients
            losses.append(loss.data.cpu().numpy())
            optim.step()  # Update the network
        print("Epoch {} Loss: {:.4f}".format(epoch, np.asscalar(np.mean(losses))))

ptrblck · December 16, 2018, 7:58pm

I used 15 for dim1 since you are dealing with 15 classes (genres) as far as I’ve understood it.
Sure, let’s see what this line of code is actually doing.
torch.zeros(labels.size(0), 15) initializes a new tensor with all zeros in the shape of [batch_size, 15]. This should be the same shape as your model output. For dim1 I’m using the number of classes, so let’s call this dimension “class dimension”.
.scatter_ is an inplace method, which uses an index tensor (labels in this case) to fill the indices given by labels with a certain value in a specified dimension.
I’m using dim=1, since I would like to use the passed indices in labels ([1, 4, 1, 0, 5, 2]) to index dim1 (the “class dimension”).
Then I’m setting src=1. to fill all specified indices with the values of 1.

Does this explanation make sense to you? Let me know, if you need some further examples or explanations.

Yes, the loop look alright. You should use item() instead of .data, but besides that it looks good!

ViniLL · December 16, 2018, 8:31pm

Oh ok, I was little confused by # 15 because now I reduced my classes to 6, but I think I get this function now. I ran your code for the case when classes = 6, but I get the output as tensor([[1., 1., 1., 0., 1., 1.]]), however in my case 0 still means a class. Do I need to change it in a dictionary of my labels?
If I do, I will get then ([[1., 1., 1., 1., 1., 1.]])
for labels = torch.tensor([2, 5, 2, 1, 6, 3])
and it will be true for all of my labels, since every each of them has 6 different values and thus they all will be activated. Will the loss function be accurate in this case then?

My batch is size of 4, so according to your example I have my code as follows:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 22 * 39, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 15)

    def forward(self, x):
        # my inputs have size of [170,100,3] so I am swapping dimensions here so it would comply model requirements
        x = x.view(-1, 3, 100, 170)
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 22 * 39)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()
net.to(device)

import torch.optim as optim

criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

print(len(trainloader))
for epoch in range(4):  # loop over the dataset multiple times

    losses = []
    
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs
        inputs, labels = data
        labels = labels.unsqueeze(-1)
        targets = torch.zeros(labels.size(0),15).scatter_(1, labels, 1.)
        targets = targets.squeeze(0)
        targets = targets.float()
        inputs, targets = inputs.to(device), targets.to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        losses.append(loss.data.cpu().numpy())
        print("Epoch {} Loss: {:.4f}".format(epoch, np.asscalar(np.mean(losses))))

print('Finished Training')


dataiter = iter(testloader)
images, labels = dataiter.next()



net.to(device)
images, labels = images.to(device), labels.to(device)
outputs = net(images)

with torch.no_grad():
    for data in testloader:
        images, labels = data
        images, labels = images.to(device), labels.to(device)
        y = net(images)
        print('Y (logits): {}'.format(y.data.cpu().numpy()))
        print('Y (argmax): {}'.format(y.data.cpu().numpy() > 0))

but the output that I am getting is :

Y (logits): [[-1.631788  -1.1816276 -2.5093026 -3.4875336 -2.4541166 -3.1875174
  -2.6211216 -3.6854322 -2.8854806 -3.81516   -3.0067604 -3.414101
  -3.5943007 -3.71672   -3.5693698]
 [-1.7025979 -1.2310716 -2.612976  -3.6319766 -2.5535479 -3.3250632
  -2.7306297 -3.8352551 -3.0070634 -3.9727454 -3.1343684 -3.5583117
  -3.7449632 -3.8727129 -3.7125916]
 [-1.6599652 -1.1989014 -2.5494416 -3.540475  -2.487864  -3.2383528
  -2.6703634 -3.7370138 -2.932359  -3.877006  -3.0570838 -3.470197
  -3.6498802 -3.7754254 -3.6216202]
 [-1.4784788 -1.0684851 -2.273665  -3.1516986 -2.2127426 -2.8756993
  -2.3788373 -3.3367095 -2.598027  -3.450463  -2.7082787 -3.08899
  -3.2494898 -3.3505046 -3.230426 ]]
Y (argmax): [[False False False False False False False False False False False False
  False False False]
 [False False False False False False False False False False False False
  False False False]
 [False False False False False False False False False False False False
  False False False]
 [False False False False False False False False False False False False
  False False False]]

even though my loss is on average 0.2213
so I am not sure what is wrong in this case… I feel like there are 3 possible reasons for that:

I don’t measure accuracy right
Something wrong with my training part( I implemented wrong my loss criterion)
Something wrong with my data ( imbalanced?)

ptrblck · December 16, 2018, 10:26pm

If you reduced the number of classes to 6, your model should also output the logits for these 6 classes.
Currently it seems like your model has an output of shape [batch_size, 15].

I’m not sure to understand your labels properly, as I thought the indices point to the classes in the current sample, e.g. a tensor of [2, 5] would indicate class2 and class5 are present for the current sample, while all others are not.
Could you explain your labels a bit more, since I think I misunderstood them?

ViniLL · December 16, 2018, 11:10pm

Sorry for confusion, I tried my code with 15 classes and let’s better stick with that assumption for this discussion. My labels are genres that were vectorized into numbers: For example [ Action, Comedy, Romance, Horror] would have [ 1, 3 , 5, 8] in current label. So you think smth is wrong with that part?

ptrblck · December 16, 2018, 11:19pm

No, that’s alright.
I was just wondering, about the other example, since there were some repetitions in the target:

labels = torch.tensor([2, 5, 2, 1, 6, 3])

This sample would have class2 “twice”. Is this a typo?
Also, you should use 0-based indices, i.e. your targets should be in the range [0, nb_classes-1].

I’m not sure, what this means:

class0 is still a valid class is will be set to 1, if the labels tensor indicates it:

labels = torch.tensor([1, 0, 5])
labels = labels.unsqueeze(0)
target = torch.zeros(labels.size(0), 6).scatter_(1, labels, 1.)
print(target)
> tensor([[1., 1., 0., 0., 0., 1.]])

ViniLL · December 16, 2018, 11:30pm

Oh ok I think it wasmisinterpreting the output… so e.g.

labels = torch.tensor([1, 4, 1, 0, 5, 2])
labels = labels.unsqueeze(0)
target = torch.zeros(labels.size(0), 6).scatter_(1, labels, 1.)
print(target)

will give output

tensor([[1., 1., 1., 0., 1., 1.]]) meaning no classes 3 was found right?

Regarding the labels, it was a typo in terms of “6” cause I was just giving an example and gave random numbers, but I do indeed use 0-based indices in my code. And regarding “2” it is not a typo, since some training batches may contain more genres from one class than another.

ptrblck · December 16, 2018, 11:35pm

Yes, exactly. Your model should output a high probability of all classes but class3 for this sample.

Sure, a batch may contain multiple samples with the same class, but my code snippet currently works on a single sample. What would the two 2s mean in that case?

ViniLL · December 16, 2018, 11:39pm

Ok, thank you for making it clear.

That means that in this particular case I have [ Adventure, Action, Adventure, Romance, Horror, Documentary] in my label tensor, where Adventure genre appears twice as a ground truth label.

I think I see what it is going on… My batch label should not have duplicates, right?

ptrblck · December 17, 2018, 12:02am

Well, you might have duplicates in a certain batch, but it is strange to have it for a single sample.
Let’s say you have a batch of two samples with the following labels:

batch[
sample0: [Adventure, Action]
sample1: [Adventure, Action, Romance]
]

This example is perfectly fine. The corresponsing target tensors could look like this (depending on the mapping between the genres and the class index):

[[2, 1],
 [2, 1, 5]]

It’s still a bit strange for me to see the same label (Adventure) for the same sample.
This would mean the example from above could look like this:

sample0: [Adventure, Adventure, Action] - [2, 2, 1]

Would you like to ignore duplicates for these samples or does it have any meaning?

ViniLL · December 17, 2018, 12:59am

Alright, let me try to make it less confusing by printing out the actual intermediate outputs from my code:

tensor([1, 0, 1, 1]) - original label from trainloader, 
torch.Size([4])  - size of this label
torch.Size([4, 1]) - size after unsqueezing
tensor([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) - this is what I have after  using scatter function and squeezing it back

tensor([ 4,  2,  6, 10])
torch.Size([4])
tensor([[ 4],
        [ 2],
        [ 6],
        [10]])
torch.Size([4, 1])
tensor([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) - same output in the next iteration

And no, duplicates don’t mean nothing in my problem and I think they appeared as a result of data preprocessing…

Initially, my labels is a list of lists, which is as such, e.g.

[ [Adventure, Action, Romance], [Horror, Thriller], [ Documentary], … ,[ Adventure, Action]]

I vectorized the data according to a dictionary and then
Since it was a list of lists, I decided to flatten it to make sure it will be easier to work with when it comes to batches… Since the # of my labels after flattening the list was not equal to the number of instances given, I decided it to cut in a way like

y_train = y_train [0 : len( x_train)]

so it would be easier for Dataloader to split it into batches. And I think that is why I ended up with some duplicates. Do you have any suggestion of how to avoid this problem?

ptrblck · December 17, 2018, 5:05am

Ah OK, thanks for clarification.
Well, I think if we can just ignore duplicates, your code should be fine, since scatter_ will just do its job.

ViniLL · December 17, 2018, 5:50am

But why then I get so many falses and my outputs all come out as negative? What I do wrong? or is it something wrong with the metric?

It also seems like I am loosing some tensors when I use squeeze unsqueeze method. E.g. in your example ,

`labels = torch.tensor([1, 4, 1, 0, 5, 2]) 

tensor([[1., 1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])`

so , all 5 values are being converted to 1s

while in my case

tensor([ 4,  2,  6, 10])
tensor([[ 4],
        [ 2],
        [ 6],
        [10]])
torch.Size([4, 1])
tensor([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Only one is converted … I feel like that the problem is here

ptrblck · December 17, 2018, 7:18am

It looks like you are unsqueezing dim1 instead of dim0.
Could you check that and see how the target tensor looks?

ViniLL · December 17, 2018, 7:55am

Sure. If I do unsqueeze (0 ), I am having the following output:

before unsqueezing
tensor([13, 14,  9,  4])
torch.Size([4])
after unsqueezing
tensor([[13, 14,  9,  4]])
torch.Size([1, 4])

after scattering
tensor(0.)

and a following error

raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([15])) must be the same as input size (torch.Size([4, 15]))

Actually that is why I changed my dimension from 0 to 1

ptrblck · December 17, 2018, 9:56am

Could you post the code where you are transforming your class indices to this multi-hot encoded format?
I would recommend to apply it somewhere beforehand on each target sample or in the __getitem__ of your Dataset, so that your training code will get the already processed targets.
Your criterion (nn.BCEWithLogitsLoss) expects the model output and target to have the same shape, so I guess the error is thrown somewhere using this criterion.

The scatter_ should work using my code. I’m not sure, why you get a scalar output.