What kind of loss is better to use in multilabel classification?

ViniLL · December 18, 2018, 2:29am

Do you mean to do it outside the training loop?
Here is how I did it:

for i, data in enumerate(trainloader, 0):
        # get the inputs
        inputs, labels = data
        labels = labels.unsqueeze(0)
        targets = torch.zeros(labels.size(0),15).scatter_(1, labels, 1.)
        targets = targets.squeeze(0)
        targets = targets.float()
        inputs, targets = inputs.to(device), targets.to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        losses.append(loss.data.cpu().numpy())
        print("Epoch {} Loss: {:.4f}".format(epoch, np.asscalar(np.mean(losses))))

I do it like you suggested me(wtih squeeze = 0 ), but If I do it this way, I am having this incompatible shapes error like that:

Target size (torch.Size([1, 15])) must be the same as input size (torch.Size([4, 15]))

My tensor shape is 4 initially before sending it to scatter and then it becomes 1 , however my inputs are remaining the same… How to fix that ?

mratsim · December 18, 2018, 9:33am

By the way you should amend your title and change multiclass (only 1 class output out of multiple choices) to multilabel (variable output with multiple fixed choices).

I had good success in Kaggle competitions with MultiLabelSoftMarginLoss which is sigmoid + binary cross-entropy (question, what’s the difference with BCEWithLogitsLoss)

You can checkout out this PyTorch tutorial kernel (note, it was for PyTorch 0.1), and the full blown code for the competition.

The tricky part that you are missing is during inference, how to convert the probabilities into discrete predicted labels.

The naive and classic way to do that would be to consider all probabilities p > 0.5 as true label, and all p < 0.5 as false and discretise with this.

For the Kaggle competition, instead I had an global optimiser running to search the best threshold for each label. This was especially helpful to deal with label imbalance in the training set.

Lastly, while you are kind of forced to use BCE loss or variations of for training, you might want to ultimately evaluate your model with a score that takes into account false negatives and false positives and penalize according to your precision/recall priorities, examples are F-beta score (like F1, F0.5 F2 …), AUC and ROC, and Matthews Correlation Coefficient.

ViniLL · December 18, 2018, 10:01am

Hi @mratsim, thanks for correction and thank you for your suggestions. In my case here I am struggling with reshaping the size of my labels, so it would be compatible with the size of my inputs and cause system gives me an error no matter what I do. I am using ( nn.BCEWithLogitsLoss ) with multi-hot encoding as @ptrblck suggested earlier, however I am running to a lot of issues, such as reshaping the size of my labels after multi-hot encoding so the size of it would be compatible with the size of my inputs… I can’t find a right solution to it, so I would appreciate if you could suggest something to solve that problem, since without it the accuracy metric that I use just gives me all output as falses( you can see it in the example of my output that I posted earlier in this topic) . I also consider that all probabilities p > 0.5 as true label, and all p < 0.5 as false , but all my probabilities turned out to be negative… I am really stuck at this point, so any help is much appreciated

ptrblck · December 18, 2018, 11:22am

I thought your labels would have variable sizes, so that I would transform them in the __getitem__ or even before it, but apparently you are able to feed a whole batch of the labels.
In that case, your labels should already be two-dimensional, so that we don’t need the unsqueeze.
In my example code I was using your sample labels tensor, which only had one dimension.
Could you check the shape of labels just after getting it from data?

ViniLL · December 18, 2018, 4:04pm

My labels were varying initially, but since it was a list of lists , I decided to flatten it to make sure it will be easier to work with when it comes to batches… Since the number of my labels after flattening the list was not equal to the number of instances given, I decided it to cut in a way like

y_train = y_train [0 : len( x_train)]

so it would be easier for Dataloader to split it into batches.

By data your mean after I loaded it to DataLoader? If so, then the shape is
torch.Size([4]) when the mini_batch = 4.

ptrblck · December 18, 2018, 8:23pm

For a batch size of 4, your labels would thus only contain a single scalar for each sample in the batch.
Could you print one example of these labels?

ViniLL · December 18, 2018, 8:42pm

Sure, here is an example:

for index, data in enumerate(trainloader, 0):  
    inputs, labels = data
    
    print(labels)
    print(labels.size())

torch.Size([4])
tensor([3, 2, 7, 4])

ptrblck · December 18, 2018, 8:56pm

Thanks for the info!
I thought each sample should have a labels tensor with 6 entries for the genres?
Currently each sample has just one class index.

E.g. I thought this would be a valid labels tensor:

tensor([[ 8, 12,  1, 12,  8,  8],
        [14, 11,  1,  8, 13,  0],
        [ 6,  9,  3,  6,  8, 11],
        [ 1, 11,  7,  9,  8,  5]])

ViniLL · December 18, 2018, 9:05pm

Yeah that tensor makes more sense to me too, but for some reason my labels tensor looks different.
I feel like this weird shape is somehow related to the fact that my mini_batch is 4 , but I don’ t know why it does not have the same shape that you described. Do you have any idea of why is that different?
Is that because my labels are inside of the list and don’t have any arrays inside as opposed to images ?

ptrblck · December 18, 2018, 10:15pm

Yeah, I think we should dig a bit deeper at this point.
Could you share your Dataset code? You don’t need to post your actual data, random values will do it.
I would like to debug your Dataset first and then we can have a look at the training loop.

ashanti · June 21, 2019, 1:36pm

I even have a problem. In fact, I’m working on a similar project except that I have 10 separate classes:
number_class = tensor([0,1,2,3,4,5,6,7,8,9]). my label is a 3D tensor whose mini_batch is 6: tensor([mini_batch, sequenz_time, feature]) .
when I get out of my DNN I have a 4D tensor: tensor([mini_batch, sequenz_time, feature, number_class]).
I used the nn.crossentropy() not spinning a reshape on the output tensor which went from 4D to 2D ( tensor([mini_batch, sequenz_time, feature, number_class])—> tensor([N, number_class]) and on the label which went from 3D to 1D (tensor([mini_batch, sequenz_time, feature])------>tensor([N]). my problem is the following: when I apply argmax (dim=3) on the output tensor. I don’t observe anything. my network doesn’t learn. can you please help me ?

sharkdeng · May 9, 2020, 10:03am

Does the y_pred needed to be wrapped in nn.Softmax before sending to this loss? For example:

# predict
x = data['image'].to(device)
y_true = data['label'].to(device).float() # one hot
y_pred = nn.Softmax(model(x).squeeze())
        
loss = nn.BCEWithLogitsLoss()(y_pred, y_true)
        
y_pred = y_pred.argmax()
y_true = y_true.argmax()
accuracy = sum(y_pred == y_true).float() / len(y_pred)
loss_cohen = cohen_kappa_score(y_pred, y_true, weights='quadratic')

ecdrid · May 9, 2020, 10:09am

When we use nn.BCEWithLogitsLoss it will apply “sgmoid” internally for you, you should add it manually if you are using nn.BCELoss …

ssalome · May 26, 2020, 9:29am

Hi @ptrblck,
I’m getting the following error when I use scatter to create the multi-hot target
RuntimeError: Expected object of device type cuda but got device type cpu for argument #1 ‘self’ in call to th_scatter
What can be the issue…even I’m doing multilabel classification

ptrblck · May 26, 2020, 9:14pm

It seems some input tensors are on the CPU, while the method expects them to be on the GPU.
Could you check the device of all input tensors and make sure you push them to the GPU before using scatter?

ssalome · May 27, 2020, 7:48am

Okay will check that.

John_Grabner · July 9, 2020, 8:00pm

warning newbie question:

if I predict no classes are present but true = all classes are present, then I get loss 0.69.
if I predict not classes are present but true = no classes are present, then I get same loss.
How does this BCEWithLogisLoss work?

criterion = torch.nn.BCEWithLogitsLoss()
target_ones = torch.ones([10, 4], dtype=torch.float) # batch of 10, 4 classes, none present
target_zeros = torch.zeros([10, 4], dtype=torch.float) # batch of 10. 4 classes, all present
for i in range(10):
y = torch.full([10, 4], i/10) # batch of 10, prob i/10 that each class is present
loss_vs_ones = criterion(y, target_ones)
loss_vs_zeros = criterion(y, target_zeros)
print(f"y= {i/10:.1f} → true are 1’s = {loss_vs_ones:.2f} | true are 0’s = {loss_vs_zeros:.2f}")

output is

y= 0.0 → true are 1’s = 0.69 | true are 0’s = 0.69
y= 0.1 → true are 1’s = 0.64 | true are 0’s = 0.74
y= 0.2 → true are 1’s = 0.60 | true are 0’s = 0.80
y= 0.3 → true are 1’s = 0.55 | true are 0’s = 0.85
y= 0.4 → true are 1’s = 0.51 | true are 0’s = 0.91
y= 0.5 → true are 1’s = 0.47 | true are 0’s = 0.97
y= 0.6 → true are 1’s = 0.44 | true are 0’s = 1.04
y= 0.7 → true are 1’s = 0.40 | true are 0’s = 1.10
y= 0.8 → true are 1’s = 0.37 | true are 0’s = 1.17
y= 0.9 → true are 1’s = 0.34 | true are 0’s = 1.24

ptrblck · July 10, 2020, 2:41am

nn.BCEWithLogitsLoss expects logits, not probabilities as its input.
An input value of 0.0 would represent a probability of 0.5, which thus yields -log(0.5) = 0.69.

If you want to use probabilities instead of logits, you could use nn.BCELoss instead.
Note that I would only recommend to use it for this type of testing and debugging, as nn.BCEWithLogitsLoss will give you more numerical stability compared to sigmoid + nn.BCELoss.

John_Grabner · July 10, 2020, 4:17pm

Thanks,

So logits predictions are between -1 and 1

criterion = torch.nn.BCEWithLogitsLoss()
target_ones = torch.ones([10, 4], dtype=torch.float) # batch of 10, 4 classes, none present
target_zeros = torch.zeros([10, 4], dtype=torch.float) # batch of 10. 4 classes, all present
for i in range(-10, 12, 2):
y = torch.full([10, 4], i/10) # batch of 10, prob i/10 that each class is present
loss_vs_ones = criterion(y, target_ones)
loss_vs_zeros = criterion(y, target_zeros)
print(f"y= {i/10:.1f} → true are 1’s = {loss_vs_ones:.2f} | true are -1’s = {loss_vs_zeros:.2f}")

Now the output looks symmetrical

y= -1.0 → true are 1’s = 1.31 | true are -1’s = 0.31
y= -0.8 → true are 1’s = 1.17 | true are -1’s = 0.37
y= -0.6 → true are 1’s = 1.04 | true are -1’s = 0.44
y= -0.4 → true are 1’s = 0.91 | true are -1’s = 0.51
y= -0.2 → true are 1’s = 0.80 | true are -1’s = 0.60
y= 0.0 → true are 1’s = 0.69 | true are -1’s = 0.69
y= 0.2 → true are 1’s = 0.60 | true are -1’s = 0.80
y= 0.4 → true are 1’s = 0.51 | true are -1’s = 0.91
y= 0.6 → true are 1’s = 0.44 | true are -1’s = 1.04
y= 0.8 → true are 1’s = 0.37 | true are -1’s = 1.17
y= 1.0 → true are 1’s = 0.31 | true are -1’s = 1.31

BCELoss expects 0 to 1

criterion = nn.BCELoss()
target_ones = torch.ones([10, 4], dtype=torch.float) # batch of 10, 4 classes, none present
target_zeros = torch.zeros([10, 4], dtype=torch.float) # batch of 10. 4 classes, all present
for i in range(11):
y = torch.full([10, 4], i/10) # batch of 10, prob i/10 that each class is present
loss_vs_ones = criterion(y, target_ones)
loss_vs_zeros = criterion(y, target_zeros)
print(f"y= {i/10:.1f} → true are 1’s = {loss_vs_ones:.2f} | true are 0’s = {loss_vs_zeros:.2f}")

now also symetrical

y= 0.0 → true are 1’s = 100.00 | true are 0’s = 0.00
y= 0.1 → true are 1’s = 2.30 | true are 0’s = 0.11
y= 0.2 → true are 1’s = 1.61 | true are 0’s = 0.22
y= 0.3 → true are 1’s = 1.20 | true are 0’s = 0.36
y= 0.4 → true are 1’s = 0.92 | true are 0’s = 0.51
y= 0.5 → true are 1’s = 0.69 | true are 0’s = 0.69
y= 0.6 → true are 1’s = 0.51 | true are 0’s = 0.92
y= 0.7 → true are 1’s = 0.36 | true are 0’s = 1.20
y= 0.8 → true are 1’s = 0.22 | true are 0’s = 1.61
y= 0.9 → true are 1’s = 0.11 | true are 0’s = 2.30
y= 1.0 → true are 1’s = 0.00 | true are 0’s = 100.00

Thank you

ptrblck · July 12, 2020, 9:14am

Note that logits are unbounded in the range [-Inf, Inf]. Of course an Inf value is not reasonable, but they can contain very low or large values in general, not only in [-1, 1].

Good to hear, the code yields the expected results now.