TensorFloat MatMul speed up issue in 3090

I am trying to run TensorFloat operations on 3090 to test the speed up using these code snippet. But in TF32 enabled execution time is more than TF32 disabled mode. The first matmul takes 5.173683166503906e-05 in TF32 disabled, whether in TF32 enabled its taking 0.0006937980651855469. Am I doing anything wrong ?

System spec:
GPU - RTX 3090FE,
torch - 1.11.0+cu113,
cuda-11.3

a_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
b_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
ab_full = a_full @ b_full
mean = ab_full.abs().mean()  

a = a_full.float()
b = b_full.float()

# Do matmul at TF32 mode.
ab_tf32 = a @ b  
error = (ab_tf32 - ab_full).abs().max()  
relative_error = error / mean 

# Do matmul with TF32 disabled.
torch.backends.cuda.matmul.allow_tf32 = False
ab_fp32 = a @ b 
error = (ab_fp32 - ab_full).abs().max() 
relative_error = error / mean 

Also one more follow up question, is there any way to detect ampere vs non ampere architecture in Pytorch? Or anything similar(not taking about torch.cuda.get_device_name()?

How do you measure that?
I’m asking because it is easy to forget to sync the GPU before taking (the start and end times) when doing manual timing. With the autograd profiler, you should be OK with the measurements by default.

Best regards

Thomas