Calculating Precision, Recall and F1 score in case of multi label classification

I have the Tensor containing the ground truth labels that are one hot encoded. My predicted tensor has the probabilities for each class. In this case, how can I calculate the precision, recall and F1 score in case of multi label classification in PyTorch?

1 Like

Precision, recall and F1 score are defined for a binary classification task.
Usually you would have to treat your data as a collection of multiple binary problems to calculate these metrics.

The multi label metric will be calculated using an average strategy, e.g. macro/micro averaging.
You could use the scikit-learn metrics to calculate these metrics.

I am using scikit learn metrics for this and used this code:

print('F1: {}'.format(f1_score(outGT, outPRED, average="samples")))
print('Precision: {}'.format(precision_score(outGT, outPRED, average="samples")))
print('Recall: {}'.format(recall_score(outGT, outPRED, average="samples")))

This is throwing this error:

ValueError: Classification metrics can't handle a mix of multilabel-indicator and continuous-multioutput targets

The output of

print('Ground Truth: {}'.format(outGT))
print('Predicted Truth: {}'.format(outPRED))

is as below:

Ground Truth: 
    0     0     0  ...      0     0     0
    0     0     0  ...      0     0     0
    0     0     0  ...      0     0     0
       ...          ⋱          ...       
    1     0     0  ...      0     0     0
    1     0     0  ...      0     0     0
    0     0     0  ...      0     0     0
[torch.cuda.FloatTensor of size 22433x14 (GPU 0)]

Predicted Truth: 
 0.0901  0.0916  0.0389  ...   0.0021  0.0078  0.0016
 0.0424  0.0084  0.0111  ...   0.0053  0.0079  0.0025
 0.0611  0.0205  0.0206  ...   0.0024  0.0074  0.0018
          ...             ⋱             ...          
 0.3588  0.0223  0.1421  ...   0.0036  0.0094  0.0035
 0.1782  0.0226  0.2275  ...   0.0033  0.0129  0.0016
 0.2574  0.0176  0.2255  ...   0.0034  0.0118  0.0023
[torch.cuda.FloatTensor of size 22433x14 (GPU 0)]

1 Like

Try to use a threshold on your predictions, so that they indicate a predicted label.
This should work:

f1_score(outGT, outPRED > 0.5, average="samples")

EDIT: Also, you might want to push the tensors to CPU first.

8 Likes

Thanks a lot. The solution makes a lot of sense. I have to perform masking before trying to calculate the score.

@ptrblck
I am also working on multi label classification task where I have ground truth labels as one hot encoded. I got predicted values for the sample and also getting loss properly. But when I am trying to compute accuracy as you suggested in the post I am still getting error as “ValueError: Classification metrics can’t handle a mix of unknown and multilabel-indicator targets”.

print('F1: {}'.format(f1_score(labels.data.to('cpu'), outputs.data.to('cpu') > 0.5, average="samples")))

I wrote the function in PyTorch in an attempt to train with F1 loss. https://gist.github.com/SuperShinyEyes/dcc68a08ff8b615442e3bc6a9b55a354

Usually this has worked for me:

def precision(outputs, labels):
op = outputs.cpu()
la = labels.cpu()
_, preds = torch.max(op, dim=1)
return torch.tensor(precision_score(la,preds, average=‘weighted’))