Calculating F1 score over batched data

I have a multi-label problem where I need to calculate the F1 Metric, currently using SKLearn Metrics f1_score with samples as average.

Is it correct that I need to add the f1 score for each batch and then divide by the length of the dataset to get the correct value. Currently I am getting a 40% f1 accuracy which seems too high considering my uneven dataset.

  • My data is multi-label an example target would be [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1]

  • I am using BCEWithLogitsLoss and am using a sigmoid on the output with a threshold to get the comparable prediction (predicted) in code below.

Code Example

for epoch in range(500):
    running_f = 0
    for batch in enumerate(custom_train_loader):
        # / loss etc
        running_f += f1_score(labels.cpu().int().numpy(), predicted.cpu().int().numpy(), average='samples') * batch_size

    epoch_f1 = running_f1 / len(val_dataset)

I don’t think you can simply calculate the average of the F1 score, as shown in this small dummy example:

preds = np.random.randint(0, 2, (100,))
targets = np.random.randint(0, 2, (100,))

f1_ref = f1_score(targets, preds)

f1_running = 0
batch_size = 10
for i in range(0, preds.shape[0], batch_size):
    pred = preds[i:i+batch_size]
    target = targets[i:i+batch_size]
    f1_running += f1_score(pred, target)

f1_running /= batch_size

print(f1_ref, f1_running)
> 0.4444444444444445 0.423989898989899

You could append the current labels and predicted arrays to separate lists, create arrays from them, and calculate the F1 score after the epoch is done.


As suggested, my updated method. Thanks @ptrblck.

for epoch in range(500):
    targets = []
    outputs = []

    for batch in enumerate(custom_train_loader):
    outputs = np.concatenate(outputs)
    targets = np.concatenate(targets)

    f1 = f1_score(outputs, targets, average='samples')
1 Like