I guess you didn’t enable TF32 for cuBLAS operations as previously mentioned.
With pure FP32 I get ~GPU: 5977.826972784215iters/s, 0.00016728486865758897s/iter as this kernel is used:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
100.0 161114759 1010 159519.6 159743.0 156352 160383 835.4 ampere_sgemm_128x64_nn
0.0 11040 1 11040.0 11040.0 11040 11040 0.0 void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…