CPU/GPU results inconsistent with matrix multiplication

Hi Zvant!

Depending on your GPU, nvidia might be switching you over by
default to the misleading (dishonestly?) named “tf32” floating-point
arithmetic. (tf32 is essentially half-precision floating-point.)

You can try turning tf32 off with:

torch.backends.cuda.matmul.allow_tf32 = False

See the following thread and the github issue @tom references in it:

Best.

K. Frank