Multi Label Classification in pytorch

I was wondering if it is a common practice to just make the encoding diverge between labels in a multilabel problem in the final layer. Is there any comparison between different settings? I am guessing multitask binary classification and multilabel aren’t that different, so the place to begin diverging the encoder is not trivial.

very good example but little confused being new working with Torch, I have extended your sample example. problem is how do you scale this with say five multi labels that would be 5! loss criterion’s!

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(20, 5) # predict logits for 5 classes

x = torch.randn(1, 20)
y = torch.tensor([[1., 0., 1., 0., 0.]]) # get classA and classC as active
y1 = torch.tensor([[0., 1., 0., 1., 0.]]) #get classB and classD as active

print('x, y: ', x, y, y1)

criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-1)

for epoch in range(20):
optimizer.zero_grad()
output = model(x)
print('output: ', output)
loss1 = criterion(output, y)
loss2 = criterion(output, y1)
loss = loss1 + loss2
loss.backward()
optimizer.step()
print(‘Loss: {:.3f}’.format(loss.item()))

I’m unsure what exactly your use case is as it seems you have two different targets for the same input and output of your model.
Could you describe your use case a bit more and why different targets are used?

My use case is based on RL (Reinforcement Learn) model it has ‘3’ observations and 6 actions (actuators on/off) any combinations can be active like [ 1., 1., 0., 0., 1., 1.] …any pointers would be helpful.

Hey @ptrblck thank you very much for providing all the useful information about Multi Label classification. I read through this timeline and didn’t find a post which discusses the following.

I want to train a model on three datasets - standard ImageNet (IN), ImagetNet-a (IN-a) and Stylized-ImageNet (SIN). SIN maps the 1000 IN classes to only 16 classes, e.g. ‘airplane’ is stands for IN label 404 and ‘bear’ stands for the IN labels 294, 295, 296, 297. To train a model on this dataset, I One-hot encoded the 16 SIN classes. For testing, I did the same, but I left out a SIN dataset with its original class labels (0-15, as imported through ImageFolder).

The authors of SIN gave a simple way of establishing a (single-label) testing function, as shown on their GitHub repository.
As another test approach for this Multi Labeling task, I implemented the following function for testing:

def multi_l_eval (model, test_loader):
  model.eval()
  correct = 0
  total = 0
  sensitivity = 0.5
  with torch.no_grad():
      for images, labels in test_loader:
          images, labels = images.to(device), labels.to(device)
          outputs = model(images)
          outputs = torch.sigmoid(outputs)

          outputs[outputs >= sensitivity] = 1
          outputs[outputs < sensitivity] = 0

          correct += (outputs == labels).sum()

          total += labels.size(0)*labels.size(1)
  return (correct/total)*100

ImageNet-a

Additionally, I want to train the model on IN-a (IN adversarial example dataset) and SIN, which I did through also one-hot encoding the IN-a dataset. Training the model on the IN-a dataset (multilabel) yields a much higher test accuracy with my function, than in the (single-label) testing approach as shown in the IN-a git repo.

Question

When now comparing the results of single-label testing approaches and my function, my function outputs completely different, much higher, accuracy percentages. Is my implemented function a correct way of analyzing the multi-label performance?
If you need more details, please ask!

Thank you for reading and taking your time :slight_smile:

Training method

def std_train_model (model,train_loader, opt, num_epochs):
  model.train()  # Setting the model to training mode
  crit = nn.BCEWithLogitsLoss()
  for epoch in range(num_epochs):
      running_loss = 0.0
      for inputs, labels in train_loader:
          inputs, labels = inputs.to(device), labels.to(device)
          opt.zero_grad()
          outputs = model(inputs)
          loss = crit(outputs, labels)
          loss.backward()
          opt.step()

          running_loss += loss.item()

      print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {running_loss / len(train_loader)}')

  print("Finished Training")

Edit

I found some information, that it is needed to balance classes for this case. How would I procede to balance the class labels of SIN regarding also the IN classes?

I don’t fully understand as you are describing to work on a multi-label classification, but also use samples with a single label only:

If each sample belongs to a single class only you would deal with a multi-class classification.
Calculating the accuracy by comparing each logit against its target instead of each sample against the class will add a bias to your accuracy and overestimate it.

E.g. take a look at this simple example where the model outputs the highest logit for class1 but the target is class0. In a multi-class classification the accuracy would be 0 since the prediction is wrong.
If you compare each logit now against the target and claim it’s a multi-class classification, the accuracy could be (nb_classes-1)/nb_classes assuming your model predicted all other classes inactive.

I included already well understood single label training as a comparison to my results in Multi Label Classification. Further I want to train the model on multiple datasets, including single label (IN-a) and multilabel (SIN) samples.

Okay, I get my mistake now. How could I test and train on the SIN dataset. How can I compute the predicted class in this Multi Label then? As I have a variable amount of correct labels, taking the top 15 highest probabilities, for example, would not be an option, right?

Okay maybe my follow up question was too unclear. How would you suggest changing the evaluation function? What exactly do you mean by “[compare] each sample against the class”? How can I realize this?

If each sample belongs to one class only, you could directly calculate the multi-class accuracy by comparing the predicted class index (e.g. via torch.argmax(output, dim=1)) against the class label (either directly or if it’s one-hot encoded transform it to the index via torch.argmax(one_hot_target, dim=1)). However, if you are really dealing with a multi-label classification, you could check util. functions such as torchmetrics.classification.MultilabelAccuracy and check how the everage is computed (e.g. micro vs. macro).

Yes, in my view it is a multi-label classification, as 1000 IN labels are mapped to 16 classes and therefore each of the 16 classes have one or more correct IN labels.

My function would now look like this:

def multi_l_eval (model, test_loader):
  model.eval()
  correct = 0
  total = 0
  with torch.no_grad():
      acc_multi_metr = torchmetrics.classification.MultilabelAccuracy(num_labels=1000, threshold=0.5, average='macro')
      acc_multi_metr.to(device)
      acc_multi_l = 0
      for i, (images, labels) in enumerate(test_loader):
          images, labels = images.to(device), labels.to(device)
          outputs = model(images)

          # Accuracy
          acc_multi_l += acc_multi_metr(outputs, labels)

          # F2 - Score
          outputs = outputs > threshold
          labels = labels > threshold
          TP = (outputs & labels).sum(1).float()
          TN = ((~outputs) & (~labels)).sum(1).float()
          FP = (outputs & (~labels)).sum(1).float()
          FN = ((~outputs) & labels).sum(1).float()
          precision = torch.mean(TP / (TP + FP + 1e-12))
          recall = torch.mean(TP / (TP + FN + 1e-12))
          F2 = (1 + 1**2) * precision * recall / (1**2 * precision + recall + 1e-12)

  return F2.mean(0), (acc_multi_l/(i+1))

average=‘macro’ and average=‘micro’ yield nearly the same value, which is also very close to the value of the function I used at the start. Did I use the MultilabelAccuracy wrongly?
Further I read that using the F2-score as a metric in a multi-label setting makes more sense, is this correct?