So I created the following simple function for evaluating the trained model:
import torch
from torch.utils.data import DataLoader
from torchmetrics import MeanAbsolutePercentageError, MeanSquaredError, MeanAbsoluteError, R2Score
def compute_metrics(y_pred, y_true, device):
mape = MeanAbsolutePercentageError().to(device)(y_pred, y_true)
mse = MeanSquaredError().to(device)(y_pred, y_true)
mae = MeanAbsoluteError().to(device)(y_pred, y_true)
r2 = R2Score().to(device)(y_pred, y_true)
return mape, mse, mae, r2
def evaluate_model(model, dataset, device):
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, drop_last=True)
with torch.no_grad():
model.eval()
ys = []
y_preds = []
for batch_idx, (X, y) in enumerate(dataloader):
y_pred = model(X)
ys.append(y)
y_preds.append(y_pred)
mape, mse, mae, r2 = compute_metrics(torch.cat(y_preds), torch.cat(ys), device)
print(f'MAPE: {mape:.4f}, MSE: {mse:.4f}, MAE: {mae:.4f}, R2: {r2:.4f}')
The function takes in the trained model, the dataset and the device and is supposed to print the MAPE, MSE, MAE and R2 metrics.
However I noticed that if I leave the shuffle
argument in DataLoader
as true
it will give me different output each time on the same model and data. Why could that be? I am accumulating all the predictions and computing the metric on all of them so the order shouldn’t matter.
Example output I am getting by calling the function few times with the same data:
MAPE: 1.0839, MSE: 0.4303, MAE: 0.4552, R2: 0.3663
MAPE: 1.0192, MSE: 0.4918, MAE: 0.4661, R2: 0.3544
MAPE: 0.9897, MSE: 0.4885, MAE: 0.4701, R2: 0.3661
Dataloader has 145 batches and even if I am dropping the last (not full one) it shouldn’t have such a strong effect I think. Is there some bug in this code?