Calculating train f1 and acc on a multi-label classification model

Hi community!

I am having some trouble calculating the f1-score and accuracy on a multi-label classification model, here is my code:

import torch.nn as nn
from torch import optim
import torch.nn.functional as F

class MultiClassifier(nn.Module):
    def __init__(self):
        super(MultiClassifier, self).__init__()
        #Create sequential containe. Modules will be added to it in the order they are passed in the constructor
        self.ConvLayer1 = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1), # 3,128,128
            nn.MaxPool2d(2,2), # op: 64, 64, 64
            nn.ReLU(), # op: 64, 64, 64
        self.ConvLayer2 = nn.Sequential(
            nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1, padding=1), # 64, 64, 64   
            nn.MaxPool2d(2,2), #op: 128, 32, 32
            nn.ReLU() # op: 128, 32, 32
        self.ConvLayer3 = nn.Sequential(
            nn.Conv2d(in_channels=128,out_channels= 256, kernel_size=3, stride=1, padding=1), # 128,32,32
            nn.MaxPool2d(2,2), #op: 256, 16, 16
            nn.ReLU(), #op: 256, 16, 16
        self.Linear1 = nn.Linear(256 * 16 * 16, 512)
        self.Linear2 = nn.Linear(512, 256)
        self.Linear3 = nn.Linear(256, 18)
    def forward(self, x):
        x = self.ConvLayer1(x)
        x = self.ConvLayer2(x)
        #x = self.ConvLayer3(x)
        x = self.ConvLayer3(x)
        x = x.view(x.size(0), -1)#view is used to change the shape of the tensor, here we flatten the convolutional layer to 1 dimension for linear layers
        x = self.Linear1(x)
        x = self.Linear2(x)
        x = self.Linear3(x)
        return torch.sigmoid(x)


from tqdm import tqdm
for epoch in range(total_epoch):
    train_loss = 0
    loop = tqdm(enumerate(train_data), total = len(train_data), leave = False)
    for i, data in loop:
        inputs, target = data['image'].float().to(device),  data['label'].float().to(device)
        outputs = model(inputs).to(device)
        loss = criterion(outputs, target).to(device)
        #update progress bar
        loop.set_description(f"Epoch [{epoch}/{total_epoch}]")
        loop.set_postfix(loss = loss.item())

    train_loss += loss.item()
    y_true = target.cpu()
    y_pred = outputs.cpu().int().detach().numpy()
    train_f1 = f1_score(y_true, y_pred, average='micro')
    train_acc = accuracy_score(y_true, y_pred)

    print('Epoch: %d, loss: %.5f, train_acc: %.2f, train_f1: %.2f'%(epoch , train_loss, train_acc, train_f1))

I think it is something to do with my output as it is a multi-hot vector containing 18 classes, any advice what should I do? Thanks!

I see some issues (or are they intentional ?) in the codes:

  1. The metrics (accuracy and F1) and loss is calculated after the loop is completed and it seems only the last iteration is used for calculation.
  2. Although, the outputs is converted to int, it is does a ceil operation and unless the prediction is exactly equal to 1, the prediction is almost always 0.
  3. Accuracy-score for multi-label maybe not the right choice as it requires matching all the labels (all 18) with the all the predictions. Hamming-loss which says how many labels are incorrectly predicted maybe a better choice. But this maybe a choice which you would want to take.

Hi @user_123454321, thanks for you reply. I am actually new with pytorch and I am also new to multi-label classification. For 1., do are you suggesting I put the loss function with in the for loop of dataloader?
And for 2 and 3, can you point me somewhereor give me an example on what should I do? Thanks!

If it’s a multi-label classification, there are modules you need to use. But if you are new, I prefer to give you directly an overview of what could be a good code for your problem (you have to google the modules I import for further understanding) : this code is just an illustration.
I include here the reviews of @user_123454321

import torch
#from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer

num_classes = 18 # I hope
class_names = list(range(num_classes)) 
label_binarizer = MultiLabelBinarizer(classes=class_names)

target_list = []
outputs_list = []
loss_list = []

for i, data in loop:
    inputs, target = data['image'].float().to(device),  data['label'].float().to(device)
    # your code
    outputs = model(inputs).to(device)
    #your code
    # ...

    y_pred = outputs.round().int() # = (outputs >= 0.5).int()

mean_loss = sum(loss_list)/len(loss_list)
y_pred =, dim=0).numpy() # shape : n_samples * n_classes
y_pred = label_binarizer.fit_transform(y_pred) # [[1, 0, ..., ], [....]]

# Here it's up to you to make sure the target is in the right format : [[1, 0, ..., ], [....]]
y_true =, dim=0).numpy()
y_true = label_binarizer.fit_transform(y_true) 

You can use your sklearn metrics now, but I recommend the pytorch_lightning ones:

from pytorch_lightning.metrics import IoU, AUROC, F1, Accuracy, AveragePrecision
from pytorch_lightning.metrics import HammingDistance # problem for some versions : cannot import name 'HammingDistance' from 'pytorch_lightning.metrics'

val_metrics = {
    'hamming_dist': HammingDistance(), 
    'iou': IoU(num_classes=num_classes),
    'auroc': AUROC(num_classes=num_classes), 
    'f1': F1(num_classes=num_classes, multilabel=True), 
    'avg_precision': AveragePrecision(num_classes=num_classes),
    'accuracy' : Accuracy(top_k=1),

result = {}
y_pred = torch.from_numpy(y_pred)
y_true = torch.from_numpy(y_true)
for key in val_metrics :
	result[k] = val_metrics[k](y_pred, y_true)

@pascal_notsawo thank you so much for your reply! I will definitely look into it! I already have the MultiLabelBinarizer implemented to turn the labels to multi-hot vector but then I just dont know how to we suppose to calculate the f1-score and the accuracy in this case

@youwillknovv I see. sklearn supports this. Take a look at the two examples given here: one for multi-class and the other for multi-label (without changing the metric used)