FP32 with TF32 precision

No, direct rounding won’t match your A4000, as e.g. accumulations are performed in FP32 as described here. Also, only convolutions are using TF32 by default while matmuls can use TF32 in newer PyTorch releases if you allow it via: torch.backends.cuda.matmul.allow_tf32 = True. Native PyTorch layers will not use TF32.