Slow A100, cudnn problem?

I guess you didn’t enable TF32 for cuBLAS operations as previously mentioned.
With pure FP32 I get ~GPU: 5977.826972784215iters/s, 0.00016728486865758897s/iter as this kernel is used:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
    100.0        161114759       1010  159519.6  159743.0    156352    160383        835.4  ampere_sgemm_128x64_nn                                                                              
      0.0            11040          1   11040.0   11040.0     11040     11040          0.0  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…