Huge performance decrease when matrix multiplication goes from float32 to float64

I noticed that matrix multiplication with torch.float32 is more than 50 times faster than with torch.float64.

import torch
torch.set_default_device(‘cuda:0’) # cuda:0 is A6000

def timer_torch(x,y):
z = x @ y
torch.cuda.synchronize()
return

float32

x = torch.randn(100,1000000)
y = torch.randn(1000000,100)

timeit timer_torch(x,y)
1.68 ms ± 4.51 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

float64

x = torch.randn(100,1000000).to(torch.float64)
y = torch.randn(1000000,100).to(torch.float64)

timeit timer_torch(x,y)
68 ms ± 325 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)

So on my A6000 with torch 2.1 and cuda 12.1, float32 is 40 times faster than float64.
Is this the expected behavior? I want to make sure this is not a result of some mismatch in libraries that would produce such dramatic difference.
If the result is expected, are there strategies to leverage off GPU when using big matrices where a higher 64 bit precision is necessary?

The result seems to be expected based on the spec sheet claiming the A6000 achieves a peak FP32 performance of ~38.7TFLOPS and a peak FP64 performance of ~1.25TFLOPS.

Thank you very much for the reply - now I don’t have to worry about debugging my configuration looking for library/version inconsistencies. Do you have any suggestion for the second part of my question: If the result is expected, are there strategies to leverage off GPU when using big matrices where the higher 64 bit precision is necessary?

Assuming you really need float64 precision you could check some data center devices, such as A100 or H100, which improve the FP64 performance significantly.