F1 score in pytorch for evaluation of the BERT

I have created a function for evaluation a function. It takes as an input the model and validation data loader and return the validation accuracy, validation loss and f1_weighted score.

def evaluate(model, val_dataloader):
    After the completion of each training epoch, measure the model's performance
    on our validation set.
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.

    # Tracking variables
    val_accuracy = []
    val_loss = []
    f1_weighted = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)

        # Compute loss
        loss = loss_fn(logits, b_labels)

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100

        # Calculate the f1 weighted score
        f1_metric = F1Score('weighted') 
        f1_weighted = f1_metric(preds, b_labels)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)
    f1_weighted = np.mean(f1_weighted)

    return val_loss, val_accuracy, f1_weighted 

The core for f1 score can be found here

Before the evaluation function there is a function which trains a bert model and has the following inputs

train(model, train_dataloader, val_dataloader, epochs, evaluation).

Thus if the evaluation = True, then the validation accuracy seems in the end of each epoch.

As for the dataloaders are created with the following way:

# Convert other data types to torch.Tensor
train_labels = torch.tensor(authors_train)

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

With a similar way you cal create the dataloader for validation and testing set.

I changed the line

f1_weighted = f1_metric(preds, b_labels)

with this one

f1_weighted.append(f1_metric(preds, b_labels))

and now I have the following error

AttributeError                            Traceback (most recent call last)
<ipython-input-49-0e0f6d227c4f> in <module>()
      1 set_seed(42)    # Set seed for reproducibility
      2 bert_classifier, optimizer, scheduler = initialize_model(epochs=4)
----> 3 train(bert_classifier, train_dataloader, val_dataloader, epochs=4, evaluation=True)
      5 #1. 77.28

3 frames
<__array_function__ internals> in mean(*args, **kwargs)

/usr/local/lib/python3.7/dist-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
    168             ret = arr.dtype.type(ret / rcount)
    169         else:
--> 170             ret = ret.dtype.type(ret / rcount)
    171     else:
    172         ret = ret / rcount

AttributeError: 'torch.dtype' object has no attribute 'type'

Based on the error message it seems you are calling np.mean on a tensor somewhere which fails with the internal .dtype.type check.
Try to narrow down where this mean call is coming from and what its input arguments are.