the first and second output can be slightly different (up to 1e-4);
the device is 3090 and pytorch version is 1.7.1;
the inconsistency has not been observed on cpu.
It’d be best to check with a dev, but it’s possibly due to the fact that 30 series cards will default to using TensorFloat32 whereas your CPU will default to Float32. (See here and here for more detail)
TensorFloat32 has the same range of Float32 but has the precision of Float16, so you’re probably seeing a round-off error with the small error of 1e-4. You can set this behaviour to False via torch.backends.cuda.matmul.allow_tf32 and torch.backends.cudnn.allow_tf32. More detail is here: CUDA semantics — PyTorch 1.10.0 documentation