I am trying to run TensorFloat operations on 3090 to test the speed up using these code snippet. But in TF32 enabled execution time is more than TF32 disabled mode. The first matmul takes 5.173683166503906e-05
in TF32 disabled, whether in TF32 enabled its taking 0.0006937980651855469
. Am I doing anything wrong ?
System spec:
GPU - RTX 3090FE,
torch - 1.11.0+cu113,
cuda-11.3
a_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
b_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
ab_full = a_full @ b_full
mean = ab_full.abs().mean()
a = a_full.float()
b = b_full.float()
# Do matmul at TF32 mode.
ab_tf32 = a @ b
error = (ab_tf32 - ab_full).abs().max()
relative_error = error / mean
# Do matmul with TF32 disabled.
torch.backends.cuda.matmul.allow_tf32 = False
ab_fp32 = a @ b
error = (ab_fp32 - ab_full).abs().max()
relative_error = error / mean
Also one more follow up question, is there any way to detect ampere vs non ampere architecture in Pytorch? Or anything similar(not taking about torch.cuda.get_device_name()
?