Model.train() loss and model.eval() loss do not match

gcortes · September 23, 2022, 5:44am

I am performing an MNIST classification. I have the following training/validation loop:

for epoch in range(epochs):
    t0 = time.time()
    model.train()

    running_loss = 0.
    running_acc = 0.

    for i, (image, label) in enumerate(train_loader):
        optimizer.zero_grad()
        image = image.to(device)
        label = label.long()
        label = label.to(device)
        y = model(image)
        loss = loss_function(y, label)
        loss.backward()
        optimizer.step()

        running_loss += label.shape[0] * loss.item()
        _, prediction = torch.max(y, 1)
        total = label.shape[0]
        correct = (prediction == label).sum().item()
        running_acc += correct/total * 100
        del image, label, y, loss

    print(f"epoch {epoch} | time (sec) : {time.time() - t0:.2f} | t_acc : {(running_acc / len(train_loader)):.2f} | t_loss : {(running_loss / len(train_loader.dataset)):.2f}", end=" | ")

    total = 0
    correct = 0

    with torch.no_grad():
        model.eval()
        running_loss = 0.
        for i, (image, label) in enumerate(train_loader):
            image = image.to(device)
            label = label.long()
            label = label.to(device)
            y = model(image)
            loss = loss_function(y, label)
            running_loss += label.shape[0] * loss.item()
            _, prediction = torch.max(y, 1)
            total += label.shape[0]
            correct += (prediction == label).sum().item()

    print(f"v_acc : {(correct/total * 100):.2f} | v_loss : {(running_loss / len(train_loader.dataset)):2f}")

Note that the second half of the loop (beginning with with torch.no_grad()) would usually iterate over a validation_loader. However, suspicious that my model was not training properly, I replaced validation_loader with train_loader. Thus, I should be seeing more or less equal values for t_loss and v_loss and t_acc and v_acc, respectively. I understand that there will be slight discrepancies since I’m computing t_loss and t_acc as a mean across batches, whereas v_loss and v_acc are computed over the entire training set. Nevertheless, looking at the output below, the discrepancies seem too large:

epoch 0 | time (sec) : 29.59 | t_acc : 21.94 | t_loss : 2.10 | v_acc : 20.28 | v_loss : 2.236833
epoch 1 | time (sec) : 29.61 | t_acc : 23.26 | t_loss : 2.06 | v_acc : 21.78 | v_loss : 2.374591
epoch 2 | time (sec) : 29.62 | t_acc : 28.54 | t_loss : 1.88 | v_acc : 28.54 | v_loss : 1.955524
epoch 3 | time (sec) : 30.20 | t_acc : 40.07 | t_loss : 1.54 | v_acc : 27.01 | v_loss : 2.046747
epoch 4 | time (sec) : 29.40 | t_acc : 44.82 | t_loss : 1.46 | v_acc : 30.15 | v_loss : 1.866847
epoch 5 | time (sec) : 29.38 | t_acc : 55.17 | t_loss : 1.23 | v_acc : 40.11 | v_loss : 1.924560
epoch 6 | time (sec) : 29.70 | t_acc : 77.41 | t_loss : 0.70 | v_acc : 79.59 | v_loss : 0.626660
epoch 7 | time (sec) : 29.65 | t_acc : 84.70 | t_loss : 0.48 | v_acc : 69.35 | v_loss : 0.946439
epoch 8 | time (sec) : 29.93 | t_acc : 87.11 | t_loss : 0.41 | v_acc : 86.33 | v_loss : 0.430216
epoch 9 | time (sec) : 29.97 | t_acc : 88.47 | t_loss : 0.37 | v_acc : 65.38 | v_loss : 1.158408
epoch 10 | time (sec) : 29.90 | t_acc : 89.18 | t_loss : 0.34 | v_acc : 71.23 | v_loss : 0.953730
epoch 11 | time (sec) : 29.63 | t_acc : 90.37 | t_loss : 0.30 | v_acc : 75.13 | v_loss : 0.789689
epoch 12 | time (sec) : 29.62 | t_acc : 90.97 | t_loss : 0.29 | v_acc : 62.28 | v_loss : 1.321358

I have a single nn.BatchNorm1d in my fully connected block, and since this is the only layer that is affected by model.eval(), I was thinking that perhaps this was the culprit. Still, I find it hard to believe that this alone is causing the above-mentioned discrepancies. Why am I experiencing this behavior?