Apparent side effects of DataLoader on model training

I am trying to do some analysis of the test loss landscape and I’m trying to implement the following basic scheme.

Within each epoch:

  1. Calculate the weight update based on the full training set
  2. Measure some metrics relating to the test loss function (this uses the weight update from step 1, but shouldn’t alter it)
  3. Perform the weight update using the values found in step 1.

The problem I have noticed is that something in step 2 is causing the model training to be different compared to if step 2 is skipped altogether - obviously it is not intended or desirable for step 2 to have an effect on training.

I provide a shortened reproduction below. The model being trained is a fully connected NN with 1 hidden layer, and I’m using a subset of MNIST as a toy dataset.


import torch
import torch.nn as nn
from import DataLoader, Subset, TensorDataset
from torchvision import datasets, transforms
from torch.nn import MSELoss
import numpy as np


class DenseNN(nn.Module):
    Fully connected neural network
    def __init__(self, num_hidden_units):
        super(DenseNN, self).__init__()
        self.num_hidden_units = num_hidden_units
        self.l1 = nn.Linear(784, num_hidden_units)
        self.activation_fun = nn.ReLU()
        self.l2 = nn.Linear(num_hidden_units, 10)

    def forward(self, x):
        return self.l2(self.activation_fun(self.l1(x)))


def cat_and_flatten(input : tuple[torch.Tensor]):
	return[torch.flatten(i) for i in input])

def one_hot_transform(y):
    Transform to convert class labels to one-hot representation
        y: [list : int]
    return (tensor, n x num_classes)
    return torch.zeros(10, dtype=torch.float).scatter_(0, torch.tensor(y), value=1)

def load_data_to_device(train_dataset, test_dataset, args):
    Given a suitably sized dataset, transfer to device before training and return related DataLoader objects.
    As MNIST is small, saves data transfer on every train/test iteration.
    return: (DataLoader, Dataloader)
    train_loader = DataLoader(train_dataset, batch_size=len(train_dataset), shuffle=True)
    _, (images, labels) = next(enumerate(train_loader))
    images, labels =["device"]),["device"])
    train_loader = DataLoader(TensorDataset(images, labels), batch_size=args["batch_size"], shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=len(test_dataset), shuffle=False)
    _, (images, labels) = next(enumerate(test_loader))
    images, labels =["device"]),["device"])
    test_loader = DataLoader(TensorDataset(images, labels), batch_size=args["test_batch_size"], shuffle=True)
    return train_loader, test_loader

def get_data(args) -> tuple[DataLoader, DataLoader]:
    Get data loaders for train and test data
    data_rng = np.random.RandomState(args["data_seed"])

    dataset1 = datasets.MNIST('./mnist_data', train=True, download=True,
                    transform=transform, target_transform=transforms.Compose([one_hot_transform]))
    dataset2 = datasets.MNIST('./mnist_data', train=False, download=True,
                    transform=transform, target_transform=transforms.Compose([one_hot_transform]))
    dataset1 = Subset(dataset1, data_rng.choice(len(dataset1), args["train_size"], replace=False))
    if not args["pre_transfer"]:
        # For use if not transferring to GPU before training loop
        train_loader = DataLoader(dataset1, batch_size=args["batch_size"], shuffle=True)
        test_loader = DataLoader(dataset2, batch_size=args["test_batch_size"])
        # Transfer whole datasets to device (i.e. GPU) before training (Faster)
        train_loader, test_loader = load_data_to_device(dataset1, dataset2, args)
    return train_loader, test_loader
args = {
    "seed": 1,
    "data_seed": 1,
    "train_size": 4000,
    "batch_size": 64,
    "test_batch_size": 1000,
    "pre_transfer": True,
    "epochs": 11,
    "lr": 0.001,
    "loss_fn": "mse",


experiment_log = {"args": args}
if torch.cuda.is_available():
if torch.cuda.is_available():
    args["device"] = torch.device("cuda")
    args["device"] = torch.device("cpu")

train_loader, test_loader = get_data(args)
model = DenseNN(20).to(args["device"])
loss_fn = MSELoss(reduction="sum")

for epoch in range(args["epochs"]):
    # Get weight update direction using the full batch of training data
    epoch_loss = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        if not args["pre_transfer"]:
            data, target =["device"]),["device"])
        output = model(data)
        loss = loss_fn(output, target) / len(data)
        epoch_loss += loss
    weight_update = (-args['lr'] * cat_and_flatten([p.grad for p in model.parameters()])).clone()

    # Get landscape metrics of test loss function
    for data, target in test_loader:
    # Apply weight update to model
    with torch.no_grad():
        model_params = nn.utils.parameters_to_vector(model.parameters())
        model_params += weight_update
        nn.utils.vector_to_parameters(model_params, model.parameters())

    print(f"epoch: {epoch}, train: {epoch_loss.item():.16f}")


Training appears to be affected when the below piece of code is included:

for data, target in test_loader:

In the full implementation I calculate metrics relating to the Jacobian and Hessian of the test loss function. Hence I loop over the test dataset and use autograd. I thought the bug would be something related to the calculation of gradients, but it appears that simply looping over the test_loader is causing some issue.

When the test_loader loop is included I obtain the results below:

epoch: 0, train: 70.4589233398437500
epoch: 1, train: 61.0079269409179688
epoch: 2, train: 59.3085861206054688
epoch: 3, train: 57.9784698486328125
epoch: 4, train: 56.7398986816406250
epoch: 5, train: 55.5860328674316406
epoch: 6, train: 54.3812026977539062
epoch: 7, train: 53.2106170654296875
epoch: 8, train: 52.1317024230957031
epoch: 9, train: 51.1557388305664062
epoch: 10, train: 50.2027473449707031

When the test_loader loop is commented out I obtain:

epoch: 0, train: 70.4589233398437500
epoch: 1, train: 60.9965400695800781
epoch: 2, train: 59.3392562866210938
epoch: 3, train: 57.9400787353515625
epoch: 4, train: 56.7479476928710938
epoch: 5, train: 55.6033439636230469
epoch: 6, train: 54.3827400207519531
epoch: 7, train: 53.2284889221191406
epoch: 8, train: 52.1244354248046875
epoch: 9, train: 51.1372718811035156
epoch: 10, train: 50.2136306762695312

As can be seen, training differs after the first epoch.

I hope I haven’t missed something trivial, but with my current knowledge I’m running out of ideas for what could be causing this.

Let me know if there is anything that needs clarification.

Many thanks :slight_smile:

You might be running into this issue.

Yes it was to do with the random state. Many thanks!