Test accuracy changes after training

I’ve noticed a strange behaviour of my network that I can’t explain: I trained a network, evaluated it on a separate test set, got an accuracy score, called the evaluation function again and got a different value.
A bit more details:
I use Jupyter Notebook in google colab. I load the data as follows:

full_dataset = datasets.ImageFolder(image_path, transform=transform)
train_size = int(train_split * len(full_dataset))
test_size = len(full_dataset) - train_size
train_dataset, test_dataset = random_split(full_dataset, [train_size, test_size])
trainloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
testloader = DataLoader(test_dataset, batch_size=batch_size)
classes = train_dataset.dataset.classes

then I define a CNN (pretrained resnet18), train it and evaluate it using this code (the output is binary):

def evaluate_model(mod):
  correct = 0
  total = 0
  with torch.no_grad():
    for data in testloader:
      imgs, labels = data[0].to(device), data[1].to(device)
      out = mod(imgs)
      preds = (out > 0.5).float()
      l = labels.unsqueeze(1).float()
      total += l.size(0)
      correct += (preds == l).sum().item()
  return ((correct/total)*100)

This function gives me significantly different values if I call it multiple times without changing anything in-between. How is that possible?
I used every option I know to fix the randomness to see if that helps, but the result is still the same.

seed = 123
import os
os.environ['PYTHONHASHSEED'] = str(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

@Sebastian_E I had the same issue back then, with the exact same seeding function that you used. There might be two problems:

  1. To debug, you can just set to one epoch and evaluate first. Once you get the predicted value, record it down. Then proceed to restart the notebook and train once again (provided your data is very small and 1 epoch takes less than a few minutes). Compare the answer.

  2. Colab has different GPUs, every GPU might yield different results even if you set the seed. Try to ensure you are using the same GPU on colab by:

gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
    print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
    print('and then re-execute this cell.')

thanks for the answer, there seem to be some problems with colab and reproducibility in general. However, I downloaded the notebook and ran it for an epoch on my CPU (I don’t have a supported GPU, thanks AMD…) and the problem was strangely still the same.
And this weird behaviour doesn’t only occur after a restart of the session. The numbers are different when I execute the evaluation function a second time directly after the first time.

Yes, I faced the issue too, so basically although I don’t know the reason, but you cannot execute the code a second time without restarting the session (with the same GPU on colab). For me, all my results are reproducible once I make sure they are on the same GPU + restart session.

I am stupid :man_facepalming:
Some of the transformations had randomness in them (e.g. random rotation or crops). That changes the images of cause and can lead to different scores.