Pytorch matmul is inconsistent on GPU

xs = torch.randn(4, 18, 2, device='cuda')
ys = torch.randn(2, 2, device='cuda')

print((xs @ ys)[0, 0, 0])
print((xs[0, 0].unsqueeze(0) @ ys)[0, 0])
print(torch.matmul(xs, ys)[0, 0, 0])

the first and second output can be slightly different (up to 1e-4);
the device is 3090 and pytorch version is 1.7.1;
the inconsistency has not been observed on cpu.

It’d be best to check with a dev, but it’s possibly due to the fact that 30 series cards will default to using TensorFloat32 whereas your CPU will default to Float32. (See here and here for more detail)

TensorFloat32 has the same range of Float32 but has the precision of Float16, so you’re probably seeing a round-off error with the small error of 1e-4. You can set this behaviour to False via torch.backends.cuda.matmul.allow_tf32 and torch.backends.cudnn.allow_tf32. More detail is here: CUDA semantics — PyTorch 1.10.0 documentation

Here’s my output for your code:

tensor(0.2188241482, device='cuda:0')
tensor(0.2188241482, device='cuda:0')
tensor(0.2188241482, device='cuda:0')

3080ti, torch 1.10.0

That’s weird, here are some of my outputs:

tensor(-0.9371, device='cuda:0')
tensor(-0.9373, device='cuda:0')
tensor(-0.9371, device='cuda:0')
tensor(-0.1670, device='cuda:0')
tensor(-0.1665, device='cuda:0')
tensor(-0.1670, device='cuda:0'
tensor(0.5239, device='cuda:0')
tensor(0.5238, device='cuda:0')
tensor(0.5239, device='cuda:0')

You could try and force deterministic to be True? And see if your results are the same?