Calculating F1 score over batched data

I have a multi-label problem where I need to calculate the F1 Metric, currently using SKLearn Metrics f1_score with samples as average.

Is it correct that I need to add the f1 score for each batch and then divide by the length of the dataset to get the correct value. Currently I am getting a 40% f1 accuracy which seems too high considering my uneven dataset.

  • My data is multi-label an example target would be [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1]

  • I am using BCEWithLogitsLoss and am using a sigmoid on the output with a threshold to get the comparable prediction (predicted) in code below.

Code Example

for epoch in range(500):
    running_f = 0
    
    for batch in enumerate(custom_train_loader):
        # ...training / loss etc
        running_f += f1_score(labels.cpu().int().numpy(), predicted.cpu().int().numpy(), average='samples') * batch_size

    epoch_f1 = running_f1 / len(val_dataset)
   

I don’t think you can simply calculate the average of the F1 score, as shown in this small dummy example:

preds = np.random.randint(0, 2, (100,))
targets = np.random.randint(0, 2, (100,))

f1_ref = f1_score(targets, preds)


f1_running = 0
batch_size = 10
for i in range(0, preds.shape[0], batch_size):
    pred = preds[i:i+batch_size]
    target = targets[i:i+batch_size]
    f1_running += f1_score(pred, target)

f1_running /= batch_size

print(f1_ref, f1_running)
> 0.4444444444444445 0.423989898989899

You could append the current labels and predicted arrays to separate lists, create arrays from them, and calculate the F1 score after the epoch is done.

3 Likes

As suggested, my updated method. Thanks @ptrblck.

for epoch in range(500):
    targets = []
    outputs = []

    for batch in enumerate(custom_train_loader):
        outputs.append(out)
        targets.append(label)
    
    outputs = np.concatenate(outputs)
    targets = np.concatenate(targets)

    f1 = f1_score(outputs, targets, average='samples')
1 Like