I have the Tensor containing the ground truth labels that are one hot encoded. My predicted tensor has the probabilities for each class. In this case, how can I calculate the precision, recall and F1 score in case of multi label classification in PyTorch?
Precision, recall and F1 score are defined for a binary classification task.
Usually you would have to treat your data as a collection of multiple binary problems to calculate these metrics.
The multi label metric will be calculated using an average strategy, e.g. macro/micro averaging.
You could use the scikit-learn metrics to calculate these metrics.
I am using scikit learn metrics for this and used this code:
print('F1: {}'.format(f1_score(outGT, outPRED, average="samples")))
print('Precision: {}'.format(precision_score(outGT, outPRED, average="samples")))
print('Recall: {}'.format(recall_score(outGT, outPRED, average="samples")))
This is throwing this error:
ValueError: Classification metrics can't handle a mix of multilabel-indicator and continuous-multioutput targets
The output of
print('Ground Truth: {}'.format(outGT))
print('Predicted Truth: {}'.format(outPRED))
is as below:
Ground Truth:
0 0 0 ... 0 0 0
0 0 0 ... 0 0 0
0 0 0 ... 0 0 0
... â± ...
1 0 0 ... 0 0 0
1 0 0 ... 0 0 0
0 0 0 ... 0 0 0
[torch.cuda.FloatTensor of size 22433x14 (GPU 0)]
Predicted Truth:
0.0901 0.0916 0.0389 ... 0.0021 0.0078 0.0016
0.0424 0.0084 0.0111 ... 0.0053 0.0079 0.0025
0.0611 0.0205 0.0206 ... 0.0024 0.0074 0.0018
... â± ...
0.3588 0.0223 0.1421 ... 0.0036 0.0094 0.0035
0.1782 0.0226 0.2275 ... 0.0033 0.0129 0.0016
0.2574 0.0176 0.2255 ... 0.0034 0.0118 0.0023
[torch.cuda.FloatTensor of size 22433x14 (GPU 0)]
Try to use a threshold on your predictions, so that they indicate a predicted label.
This should work:
f1_score(outGT, outPRED > 0.5, average="samples")
EDIT: Also, you might want to push the tensors to CPU first.
Thanks a lot. The solution makes a lot of sense. I have to perform masking before trying to calculate the score.
@ptrblck
I am also working on multi label classification task where I have ground truth labels as one hot encoded. I got predicted values for the sample and also getting loss properly. But when I am trying to compute accuracy as you suggested in the post I am still getting error as âValueError: Classification metrics canât handle a mix of unknown and multilabel-indicator targetsâ.
print('F1: {}'.format(f1_score(labels.data.to('cpu'), outputs.data.to('cpu') > 0.5, average="samples")))
I wrote the function in PyTorch in an attempt to train with F1 loss. https://gist.github.com/SuperShinyEyes/dcc68a08ff8b615442e3bc6a9b55a354
Usually this has worked for me:
def precision(outputs, labels):
op = outputs.cpu()
la = labels.cpu()
_, preds = torch.max(op, dim=1)
return torch.tensor(precision_score(la,preds, average=âweightedâ))
Why we have to push the tensors to CPU ? Canât I do calculation on GPU ?
scikit-learn
metrics use numpy under the hood which does not support tensors stored on the GPU. You should also note the post was created 4 years ago and nowadays you might want to check e.g. torchmetrics
.