What kind of loss is better to use in multilabel classification?

Do you mean to do it outside the training loop?
Here is how I did it:

for i, data in enumerate(trainloader, 0):
        # get the inputs
        inputs, labels = data
        labels = labels.unsqueeze(0)
        targets = torch.zeros(labels.size(0),15).scatter_(1, labels, 1.)
        targets = targets.squeeze(0)
        targets = targets.float()
        inputs, targets = inputs.to(device), targets.to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        losses.append(loss.data.cpu().numpy())
        print("Epoch {} Loss: {:.4f}".format(epoch, np.asscalar(np.mean(losses))))

I do it like you suggested me(wtih squeeze = 0 ), but If I do it this way, I am having this incompatible shapes error like that:

Target size (torch.Size([1, 15])) must be the same as input size (torch.Size([4, 15]))

My tensor shape is 4 initially before sending it to scatter and then it becomes 1 , however my inputs are remaining the same… How to fix that ?

By the way you should amend your title and change multiclass (only 1 class output out of multiple choices) to multilabel (variable output with multiple fixed choices).

I had good success in Kaggle competitions with MultiLabelSoftMarginLoss which is sigmoid + binary cross-entropy (question, what’s the difference with BCEWithLogitsLoss)

You can checkout out this PyTorch tutorial kernel (note, it was for PyTorch 0.1), and the full blown code for the competition.

The tricky part that you are missing is during inference, how to convert the probabilities into discrete predicted labels.

The naive and classic way to do that would be to consider all probabilities p > 0.5 as true label, and all p < 0.5 as false and discretise with this.

For the Kaggle competition, instead I had an global optimiser running to search the best threshold for each label. This was especially helpful to deal with label imbalance in the training set.

Lastly, while you are kind of forced to use BCE loss or variations of for training, you might want to ultimately evaluate your model with a score that takes into account false negatives and false positives and penalize according to your precision/recall priorities, examples are F-beta score (like F1, F0.5 F2 …), AUC and ROC, and Matthews Correlation Coefficient.

3 Likes

Hi @mratsim, thanks for correction and thank you for your suggestions. In my case here I am struggling with reshaping the size of my labels, so it would be compatible with the size of my inputs and cause system gives me an error no matter what I do. I am using ( nn.BCEWithLogitsLoss ) with multi-hot encoding as @ptrblck suggested earlier, however I am running to a lot of issues, such as reshaping the size of my labels after multi-hot encoding so the size of it would be compatible with the size of my inputs… I can’t find a right solution to it, so I would appreciate if you could suggest something to solve that problem, since without it the accuracy metric that I use just gives me all output as falses( you can see it in the example of my output that I posted earlier in this topic) . I also consider that all probabilities p > 0.5 as true label, and all p < 0.5 as false , but all my probabilities turned out to be negative… I am really stuck at this point, so any help is much appreciated

I thought your labels would have variable sizes, so that I would transform them in the __getitem__ or even before it, but apparently you are able to feed a whole batch of the labels.
In that case, your labels should already be two-dimensional, so that we don’t need the unsqueeze.
In my example code I was using your sample labels tensor, which only had one dimension.
Could you check the shape of labels just after getting it from data?

My labels were varying initially, but since it was a list of lists , I decided to flatten it to make sure it will be easier to work with when it comes to batches… Since the number of my labels after flattening the list was not equal to the number of instances given, I decided it to cut in a way like

y_train = y_train [0 : len( x_train)]

so it would be easier for Dataloader to split it into batches.

By data your mean after I loaded it to DataLoader? If so, then the shape is
torch.Size([4]) when the mini_batch = 4.

For a batch size of 4, your labels would thus only contain a single scalar for each sample in the batch.
Could you print one example of these labels?

Sure, here is an example:

for index, data in enumerate(trainloader, 0):  
    inputs, labels = data
    
    print(labels)
    print(labels.size())


torch.Size([4])
tensor([3, 2, 7, 4])

Thanks for the info!
I thought each sample should have a labels tensor with 6 entries for the genres?
Currently each sample has just one class index.

E.g. I thought this would be a valid labels tensor:

tensor([[ 8, 12,  1, 12,  8,  8],
        [14, 11,  1,  8, 13,  0],
        [ 6,  9,  3,  6,  8, 11],
        [ 1, 11,  7,  9,  8,  5]])

Yeah that tensor makes more sense to me too, but for some reason my labels tensor looks different.
I feel like this weird shape is somehow related to the fact that my mini_batch is 4 , but I don’ t know why it does not have the same shape that you described. Do you have any idea of why is that different?
Is that because my labels are inside of the list and don’t have any arrays inside as opposed to images ?

Yeah, I think we should dig a bit deeper at this point.
Could you share your Dataset code? You don’t need to post your actual data, random values will do it.
I would like to debug your Dataset first and then we can have a look at the training loop.

I even have a problem. In fact, I’m working on a similar project except that I have 10 separate classes:
number_class = tensor([0,1,2,3,4,5,6,7,8,9]). my label is a 3D tensor whose mini_batch is 6: tensor([mini_batch, sequenz_time, feature]) .
when I get out of my DNN I have a 4D tensor: tensor([mini_batch, sequenz_time, feature, number_class]).
I used the nn.crossentropy() not spinning a reshape on the output tensor which went from 4D to 2D ( tensor([mini_batch, sequenz_time, feature, number_class])—> tensor([N, number_class]) and on the label which went from 3D to 1D (tensor([mini_batch, sequenz_time, feature])------>tensor([N]). my problem is the following: when I apply argmax (dim=3) on the output tensor. I don’t observe anything. my network doesn’t learn. can you please help me ?

Does the y_pred needed to be wrapped in nn.Softmax before sending to this loss? For example:

# predict
x = data['image'].to(device)
y_true = data['label'].to(device).float() # one hot
y_pred = nn.Softmax(model(x).squeeze())
        
loss = nn.BCEWithLogitsLoss()(y_pred, y_true)
        
y_pred = y_pred.argmax()
y_true = y_true.argmax()
accuracy = sum(y_pred == y_true).float() / len(y_pred)
loss_cohen = cohen_kappa_score(y_pred, y_true, weights='quadratic')

When we use nn.BCEWithLogitsLoss it will apply “sgmoid” internally for you, you should add it manually if you are using nn.BCELoss

Hi @ptrblck,
I’m getting the following error when I use scatter to create the multi-hot target
RuntimeError: Expected object of device type cuda but got device type cpu for argument #1 ‘self’ in call to th_scatter
What can be the issue…even I’m doing multilabel classification

It seems some input tensors are on the CPU, while the method expects them to be on the GPU.
Could you check the device of all input tensors and make sure you push them to the GPU before using scatter?

Okay will check that.

warning newbie question:

if I predict no classes are present but true = all classes are present, then I get loss 0.69.
if I predict not classes are present but true = no classes are present, then I get same loss.
How does this BCEWithLogisLoss work?

criterion = torch.nn.BCEWithLogitsLoss()
target_ones = torch.ones([10, 4], dtype=torch.float) # batch of 10, 4 classes, none present
target_zeros = torch.zeros([10, 4], dtype=torch.float) # batch of 10. 4 classes, all present
for i in range(10):
y = torch.full([10, 4], i/10) # batch of 10, prob i/10 that each class is present
loss_vs_ones = criterion(y, target_ones)
loss_vs_zeros = criterion(y, target_zeros)
print(f"y= {i/10:.1f} → true are 1’s = {loss_vs_ones:.2f} | true are 0’s = {loss_vs_zeros:.2f}")

output is

y= 0.0 → true are 1’s = 0.69 | true are 0’s = 0.69
y= 0.1 → true are 1’s = 0.64 | true are 0’s = 0.74
y= 0.2 → true are 1’s = 0.60 | true are 0’s = 0.80
y= 0.3 → true are 1’s = 0.55 | true are 0’s = 0.85
y= 0.4 → true are 1’s = 0.51 | true are 0’s = 0.91
y= 0.5 → true are 1’s = 0.47 | true are 0’s = 0.97
y= 0.6 → true are 1’s = 0.44 | true are 0’s = 1.04
y= 0.7 → true are 1’s = 0.40 | true are 0’s = 1.10
y= 0.8 → true are 1’s = 0.37 | true are 0’s = 1.17
y= 0.9 → true are 1’s = 0.34 | true are 0’s = 1.24

nn.BCEWithLogitsLoss expects logits, not probabilities as its input.
An input value of 0.0 would represent a probability of 0.5, which thus yields -log(0.5) = 0.69.

If you want to use probabilities instead of logits, you could use nn.BCELoss instead.
Note that I would only recommend to use it for this type of testing and debugging, as nn.BCEWithLogitsLoss will give you more numerical stability compared to sigmoid + nn.BCELoss.

Thanks,

So logits predictions are between -1 and 1

criterion = torch.nn.BCEWithLogitsLoss()
target_ones = torch.ones([10, 4], dtype=torch.float) # batch of 10, 4 classes, none present
target_zeros = torch.zeros([10, 4], dtype=torch.float) # batch of 10. 4 classes, all present
for i in range(-10, 12, 2):
y = torch.full([10, 4], i/10) # batch of 10, prob i/10 that each class is present
loss_vs_ones = criterion(y, target_ones)
loss_vs_zeros = criterion(y, target_zeros)
print(f"y= {i/10:.1f} → true are 1’s = {loss_vs_ones:.2f} | true are -1’s = {loss_vs_zeros:.2f}")

Now the output looks symmetrical

y= -1.0 → true are 1’s = 1.31 | true are -1’s = 0.31
y= -0.8 → true are 1’s = 1.17 | true are -1’s = 0.37
y= -0.6 → true are 1’s = 1.04 | true are -1’s = 0.44
y= -0.4 → true are 1’s = 0.91 | true are -1’s = 0.51
y= -0.2 → true are 1’s = 0.80 | true are -1’s = 0.60
y= 0.0 → true are 1’s = 0.69 | true are -1’s = 0.69
y= 0.2 → true are 1’s = 0.60 | true are -1’s = 0.80
y= 0.4 → true are 1’s = 0.51 | true are -1’s = 0.91
y= 0.6 → true are 1’s = 0.44 | true are -1’s = 1.04
y= 0.8 → true are 1’s = 0.37 | true are -1’s = 1.17
y= 1.0 → true are 1’s = 0.31 | true are -1’s = 1.31

BCELoss expects 0 to 1

criterion = nn.BCELoss()
target_ones = torch.ones([10, 4], dtype=torch.float) # batch of 10, 4 classes, none present
target_zeros = torch.zeros([10, 4], dtype=torch.float) # batch of 10. 4 classes, all present
for i in range(11):
y = torch.full([10, 4], i/10) # batch of 10, prob i/10 that each class is present
loss_vs_ones = criterion(y, target_ones)
loss_vs_zeros = criterion(y, target_zeros)
print(f"y= {i/10:.1f} → true are 1’s = {loss_vs_ones:.2f} | true are 0’s = {loss_vs_zeros:.2f}")

now also symetrical

y= 0.0 → true are 1’s = 100.00 | true are 0’s = 0.00
y= 0.1 → true are 1’s = 2.30 | true are 0’s = 0.11
y= 0.2 → true are 1’s = 1.61 | true are 0’s = 0.22
y= 0.3 → true are 1’s = 1.20 | true are 0’s = 0.36
y= 0.4 → true are 1’s = 0.92 | true are 0’s = 0.51
y= 0.5 → true are 1’s = 0.69 | true are 0’s = 0.69
y= 0.6 → true are 1’s = 0.51 | true are 0’s = 0.92
y= 0.7 → true are 1’s = 0.36 | true are 0’s = 1.20
y= 0.8 → true are 1’s = 0.22 | true are 0’s = 1.61
y= 0.9 → true are 1’s = 0.11 | true are 0’s = 2.30
y= 1.0 → true are 1’s = 0.00 | true are 0’s = 100.00

Thank you

Note that logits are unbounded in the range [-Inf, Inf]. Of course an Inf value is not reasonable, but they can contain very low or large values in general, not only in [-1, 1].

Good to hear, the code yields the expected results now.