F1 score - binary vs multiclass with 2 classes

Dear all,
For once it’s not really a question that I ask but more something i want to share with you all.
I’ve spent way too much time in the field without knowing the following details. Maybe it was clear for most of you, but i think it’s an important point to raise and the documentation could be clearer about that.

Let’s check the following code.

import torch
from torcheval.metrics.functional import binary_f1_score, multiclass_f1_score
pred = torch.tensor([0, 1, 1, 0, 1, 0, 1, 0, 1])
target = torch.tensor([0, 1, 1, 0, 0, 0, 0, 0, 0])
print(binary_f1_score(pred, target))   # 0.5714
print(multiclass_f1_score(pred, target, num_classes=2, average='micro'))  # 0.6667
print(multiclass_f1_score(pred, target, num_classes=2, average='macro'))  # 0.6494

from sklearn.metrics import f1_score
print(f1_score(target, pred))  # 0.5714
print(f1_score(target, pred, average='macro'))  # 0.6494

You can see that the f1_score of scikit learn gives the same results than the binary_f1_score of pytorch, because scikit learn use a default ‘binary’ mode not existing in multiclass_f1_score.

Mr Erwan gives a great explanation here: machine learning - Macro averaged in binary classification - Data Science Stack Exchange
Commonly there is a majority and a minority class, and naturally the majority class is easier to predict for the classifier. That’s why the minority class is usually chosen as the positive class: by choosing the most difficult class to predict, the performance value represents more precisely the real ability of the classifier. As a consequence, the macro-average performance is often better than the performance on the positive class, since the former includes the “easy” class.

In conclusion, when you work with 2 class-classification, be careful if you compare your result with a 2-class averaged f1 score or with a binary f1-score. In my case I had a great difference of 0.5 vs 0.72.