cuDNN causing inconsistent test results depending on batch_size

I have encountered an odd problem. I have a trained model with many Conv1d layers. When I set the model to eval() and run the test set through, I receive different accuracies depending on the batch_size.

For example, when I run the same test data through my model all at once vs one at a time, my predictions are wildly different.

model.eval()
model.train(False)

batch_size = len(xtest)
batch_guesses = np.array([])
print("BATCH SIZE ", batch_size)
for i in range(0, len(xtest), batch_size):
    inputs = torch.Tensor(xtest[i : i + batch_size]).to(device)
    output = model(inputs)
    prediction = torch.argmax(output, dim=1)
    prediction = prediction.detach().cpu().numpy()
    batch_guesses = np.append(batch_guesses, prediction)

print("INDIVIDUAL")
batch_size = 1
single_guesses = np.array([])
for i in range(0, len(xtest), batch_size):
    inputs = torch.Tensor(xtest[i : i + batch_size]).to(device)
    output = model(inputs)
    prediction = torch.argmax(output, dim=1)
    prediction = prediction.detach().cpu().numpy()
    single_guesses = np.append(single_guesses, prediction)


# find % single_guesses and batch_guesses that are the same
print(np.sum(single_guesses == batch_guesses) / len(single_guesses))
# 1.0 with torch.backends.cudnn.enabled = False
# 0.790625 torch.backends.cudnn.enabled = True

However, I found that this was only the case when torch.backends.cudnn.enabled = True. Setting it to false caused single_guesses and batch_guesses to be identical regardless of batch size. I also found that the larger the difference between the batch sizes, the more predictions would be different, so batch size 1 and 64 had 92% similarity in comparison to a batch size of 1 and 2056’s 79% similarity.

Why does cudnn have such a great impact on classification depending on the batch size?

It seems you are comparing for equal values, which is a bad idea using floating point math.
Small numerical mismatches are expected and you should use a small threshold to compare different algorithms against each other. Here is a small example on the CPU not using cuDNN at all:

x = torch.randn(10, 10, 10)
s1 = x.sum()
s2 = x.sum(0).sum(0).sum(0)
print((s1 - s2).abs())
# tensor(7.6294e-06)
print(s1 == s2)
# tensor(False)
print(torch.allclose(s1, s2))
# True

Thank you for your prompt response!

In the example I’m comparing the argmax of the output, not the logits themselves. So it’s a comparison of integer values where == should be fine.

Ah, I see. In this case, could you post a minimal and executable code snippet to reproduce the issue as well as the output of python -m torch.utils.collect_env, please?