Evaluate Twice, Accuracy Changes if I Shuffle

I have a sanity test for my model when I need to use it for evaluation. I do not understand why the accuracy changes every time I evaluate. It seems to change on some sort of period if I just keep evaluating, so I’m guessing the shuffling order will change periodically in some way (maybe the index the Dataloader is starting at changes or something like that).

My models are ResNet10 on CIFAR-10.

I’ve tried converting the tensors to doubles in the evaluate function (as shown below) to fix numerical issues but to no avail.

I’m fixing all other sources of randomness I know. Why would this happen? Why would shuffling the data change the accuracy like that?

def fix_seed(seed):
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
  acc = evaluate(model, test_loader)
  print("Model 1 Original Accuracy: {}".format(acc))
  assert acc == evaluate(model, test_loader)


def evaluate(model, test_loader):

    # TODO, why would the total_correct / total_num change as we shuffled the data differently?
    for _, (images, labels) in enumerate(test_loader):
        total_correct, total_num = 0., 0.

        with torch.no_grad():
            labels = labels.cuda().double()
            img = images.cuda()
            h = model(img)
            preds = h.argmax(dim=1).double()
            total_correct = (preds == labels).sum().cpu().item()
            total_num += h.shape[0]

    return total_correct / total_num

Hello! Just a quick note / question. Below:

How do you define acc? This snippet shows where you define acc2 but not acc. And you’re comparing acc rather than acc2 with the re-evaluated model, so just wanted to confirm that this isn’t somehow causing the issue.

Also, antother quick idea is: have you tried also adding this to your list of deterministic instructions:

Per the documentation, that does more than just torch.backends.cudnn.deterministic = True even though they sound similar.

Ok thanks for the tip with the deterministic algorithms! Regarding acc, I actually have two models so I modified the snippet but made a mistake. They are all acc.

Resolved! If you look at the code inside evaluate you see that the total_number and total_correct are initialized in the wrong location. This means that every run it returns the value for a different batch (which is probably why it’s cyclic: it starts at different batches in some sort of cycle). The accuracies we were seeing were inside the batch.