# Calculating F1 score over batched data

I have a multi-label problem where I need to calculate the F1 Metric, currently using SKLearn Metrics f1_score with samples as average.

Is it correct that I need to add the f1 score for each batch and then divide by the length of the dataset to get the correct value. Currently I am getting a 40% f1 accuracy which seems too high considering my uneven dataset.

• My data is multi-label an example target would be `[0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1]`

• I am using BCEWithLogitsLoss and am using a sigmoid on the output with a threshold to get the comparable prediction (`predicted`) in code below.

Code Example

``````for epoch in range(500):
running_f = 0

# ...training / loss etc
running_f += f1_score(labels.cpu().int().numpy(), predicted.cpu().int().numpy(), average='samples') * batch_size

epoch_f1 = running_f1 / len(val_dataset)

``````

I don’t think you can simply calculate the average of the F1 score, as shown in this small dummy example:

``````preds = np.random.randint(0, 2, (100,))
targets = np.random.randint(0, 2, (100,))

f1_ref = f1_score(targets, preds)

f1_running = 0
batch_size = 10
for i in range(0, preds.shape, batch_size):
pred = preds[i:i+batch_size]
target = targets[i:i+batch_size]
f1_running += f1_score(pred, target)

f1_running /= batch_size

print(f1_ref, f1_running)
> 0.4444444444444445 0.423989898989899
``````

You could append the current `labels` and `predicted` arrays to separate lists, create arrays from them, and calculate the F1 score after the epoch is done.

3 Likes

As suggested, my updated method. Thanks @ptrblck.

``````for epoch in range(500):
targets = []
outputs = []