Batch size of test dataloader influences model performance even in eval() mode

I have model with 3 LSTM layers and one full connected layer, I use MPS on MacBook Air with M2 processor.

I have set batch size of train dataloader at 64, and I have different results of model performance evaluation when I set batch size of test dataloader at 64 and 256, although I have set model in eval() mode

Training model code:

    for epoch in range(epochs):

        model.train()
        print(f'Epoch {epoch}')
        __train_loop(model, train_dataloader, loss_function, optimizer, scheduler, verbose, device=device)

        if test_per_epoch:

            model.eval()
            train_loss, train_accuracy, train_f_score = test_model(model, train_dataloader, loss_function, device=device)
def test_model(model, test_dataloader: DataLoader, loss_function, device='cpu'):
    loss = 0
    y_pred_all = []
    y_all = []
    with progressbar.ProgressBar(max_value=(len(test_dataloader))) as bar:
        with torch.no_grad():

            for batch_id, (X, y) in enumerate(test_dataloader):
                X, y = X.to(device), y.to(device)

                y_pred = model(X)
                loss += loss_function(y_pred, y.to(device)).item()
                y_pred = torch.argmax(y_pred, dim=1)

                y_pred_all.append(y_pred.cpu().numpy())
                y_all.append(y.cpu().numpy())
                bar.update(batch_id)

    loss = 0  

    y_pred_all = np.hstack(y_pred_all).flatten()
    y_all = np.hstack(y_all).flatten()

    cr = classification_report(y_all, y_pred_all, output_dict=True)
    f_score = cr['macro avg']['f1-score']
    accuracy = cr['accuracy']

    return loss, accuracy, f_score

Model:

 (lstm): LSTM(3, 25, num_layers=3, batch_first=True, dropout=0.7)
 (dense): Linear(in_features=25, out_features=2, bias=True)

For loss function I use torch.nn.CrossEntropyLoss() . For optimizer I use torch.optim.Adam()

learning rate is 0.001 with ExponentialLR(gamma=0.95) scheduler

Model performance after 6th epoch when batch size of test dataloader is 64

Train accuracy: 0.9087542087542088
Train F-Score: 0.908342717859852

Test accuracy: 0.8906356801093643
Test F-Score: 0.8901527949844201

Model performance after 6th epoch when batch size of test dataloader is 256

Train accuracy: 0.8942760942760942
Train F-Score: 0.8939527432555182

Test accuracy: 0.5974025974025974
Test F-Score: 0.5557444197838286

There is no such effect on Nvidia CUDA graphic cards, so I created this topic in MPS category. Why it works like that? Thanks a lot in advance for answer!

batch normalization layers (if any) might exhibit different behaviors based on batch size due to their internal running statistics.

do u have any batch normalization layer

No, my model has only LSTM layer and Linear layer