Why shuffling DataLoader changes model accuracy even though model is in eval mode?

So I created the following simple function for evaluating the trained model:

import torch
from torch.utils.data import DataLoader
from torchmetrics import MeanAbsolutePercentageError, MeanSquaredError, MeanAbsoluteError, R2Score

def compute_metrics(y_pred, y_true, device):
    mape = MeanAbsolutePercentageError().to(device)(y_pred, y_true)
    mse = MeanSquaredError().to(device)(y_pred, y_true)
    mae = MeanAbsoluteError().to(device)(y_pred, y_true)
    r2 = R2Score().to(device)(y_pred, y_true)
    return mape, mse, mae, r2


def evaluate_model(model, dataset, device):
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True, drop_last=True)
    with torch.no_grad():
        model.eval()
        ys = []
        y_preds = []
        for batch_idx, (X, y) in enumerate(dataloader):
            y_pred = model(X)
            ys.append(y)
            y_preds.append(y_pred)
        mape, mse, mae, r2 = compute_metrics(torch.cat(y_preds), torch.cat(ys), device)
        print(f'MAPE: {mape:.4f}, MSE: {mse:.4f}, MAE: {mae:.4f}, R2: {r2:.4f}')

The function takes in the trained model, the dataset and the device and is supposed to print the MAPE, MSE, MAE and R2 metrics.
However I noticed that if I leave the shuffle argument in DataLoader as true it will give me different output each time on the same model and data. Why could that be? I am accumulating all the predictions and computing the metric on all of them so the order shouldn’t matter.
Example output I am getting by calling the function few times with the same data:

MAPE: 1.0839, MSE: 0.4303, MAE: 0.4552, R2: 0.3663
MAPE: 1.0192, MSE: 0.4918, MAE: 0.4661, R2: 0.3544
MAPE: 0.9897, MSE: 0.4885, MAE: 0.4701, R2: 0.3661

Dataloader has 145 batches and even if I am dropping the last (not full one) it shouldn’t have such a strong effect I think. Is there some bug in this code?

Could you keep the last batch and check how large the difference would be?

1 Like

Thanks, that helped, I guess the variation really was due to changing samples, and I didn’t realize that during inference it doesn’t make sense anymore to drop the last batch, so that solved the problem.