Calculating Precision, Recall and F1 score in case of multi label classification

somnath · October 29, 2018, 1:48pm

I have the Tensor containing the ground truth labels that are one hot encoded. My predicted tensor has the probabilities for each class. In this case, how can I calculate the precision, recall and F1 score in case of multi label classification in PyTorch?

ptrblck · October 29, 2018, 1:59pm

Precision, recall and F1 score are defined for a binary classification task.
Usually you would have to treat your data as a collection of multiple binary problems to calculate these metrics.

The multi label metric will be calculated using an average strategy, e.g. macro/micro averaging.
You could use the scikit-learn metrics to calculate these metrics.

somnath · October 29, 2018, 2:03pm

I am using scikit learn metrics for this and used this code:

print('F1: {}'.format(f1_score(outGT, outPRED, average="samples")))
print('Precision: {}'.format(precision_score(outGT, outPRED, average="samples")))
print('Recall: {}'.format(recall_score(outGT, outPRED, average="samples")))

This is throwing this error:

ValueError: Classification metrics can't handle a mix of multilabel-indicator and continuous-multioutput targets

The output of

print('Ground Truth: {}'.format(outGT))
print('Predicted Truth: {}'.format(outPRED))

is as below:

Ground Truth: 
    0     0     0  ...      0     0     0
    0     0     0  ...      0     0     0
    0     0     0  ...      0     0     0
       ...          ⋱          ...       
    1     0     0  ...      0     0     0
    1     0     0  ...      0     0     0
    0     0     0  ...      0     0     0
[torch.cuda.FloatTensor of size 22433x14 (GPU 0)]

Predicted Truth: 
 0.0901  0.0916  0.0389  ...   0.0021  0.0078  0.0016
 0.0424  0.0084  0.0111  ...   0.0053  0.0079  0.0025
 0.0611  0.0205  0.0206  ...   0.0024  0.0074  0.0018
          ...             ⋱             ...          
 0.3588  0.0223  0.1421  ...   0.0036  0.0094  0.0035
 0.1782  0.0226  0.2275  ...   0.0033  0.0129  0.0016
 0.2574  0.0176  0.2255  ...   0.0034  0.0118  0.0023
[torch.cuda.FloatTensor of size 22433x14 (GPU 0)]

ptrblck · October 29, 2018, 2:09pm

Try to use a threshold on your predictions, so that they indicate a predicted label.
This should work:

f1_score(outGT, outPRED > 0.5, average="samples")

EDIT: Also, you might want to push the tensors to CPU first.

somnath · October 29, 2018, 2:16pm

Thanks a lot. The solution makes a lot of sense. I have to perform masking before trying to calculate the score.

Surekha_Gaikwad · September 5, 2019, 2:22pm

@ptrblck
I am also working on multi label classification task where I have ground truth labels as one hot encoded. I got predicted values for the sample and also getting loss properly. But when I am trying to compute accuracy as you suggested in the post I am still getting error as “ValueError: Classification metrics can’t handle a mix of unknown and multilabel-indicator targets”.

print('F1: {}'.format(f1_score(labels.data.to('cpu'), outputs.data.to('cpu') > 0.5, average="samples")))

shinyeyes · October 15, 2019, 10:17am

I wrote the function in PyTorch in an attempt to train with F1 loss. https://gist.github.com/SuperShinyEyes/dcc68a08ff8b615442e3bc6a9b55a354

debalb · September 5, 2020, 11:52am

Usually this has worked for me:

def precision(outputs, labels):
op = outputs.cpu()
la = labels.cpu()
_, preds = torch.max(op, dim=1)
return torch.tensor(precision_score(la,preds, average=‘weighted’))

gmanoj · July 31, 2024, 10:32am

Why we have to push the tensors to CPU ? Can’t I do calculation on GPU ?

ptrblck · July 31, 2024, 11:05am

scikit-learn metrics use numpy under the hood which does not support tensors stored on the GPU. You should also note the post was created 4 years ago and nowadays you might want to check e.g. torchmetrics.