Slow A100, cudnn problem?

Hi, I’ve just switched to a cluster with an A100 GPU, but I’m seeing worse performances than what I had on the previous card I was using (i.e. V100). By looking at other discussions, I believe it could be a cudnn version related issue.
I’m working with an installation of pytorch in a conda environment, with the following specifications:

torch.__version__ = '1.13.1'
torch.cuda.get_device_name = 'NVIDIA A100 80GB PCIe'
torch.version.cuda = '11.6'
torch.backends.cudnn.version() = 8302

I’ve read online that the CUDA version should be 11.x, so there should not be any problem, since the one installed is 11.6.
Are there any recommended cudnn versions (or torch versions) for working with an A100? Can the problem be solved via a conda install?

In case you are using the default float32 for your model training you might consider enabling TF32 for cuBLAS operations via: torch.backends.cuda.matmul.allow_tf32 = True and see if this would give you the desired speedup.
Also, I would recommend updating to the latest PyTorch release with the latest CUDA runtime.

Sorry for the late reply, but I didn’t have access to the cluster this week.
After running some tests, and despite using the command you recommended, I get the following times for multiplying two 1000x1000 matrices:

cpu time 0.00585627555847168
gpu time 0.8221950531005859

I have also reinstalled the new version and also installed the cudatoolkit-dev, because I saw that the nvcc command was missing, but nothing.

I cannot reproduce a slowdown and get:

GPU: 23456.076499351133iters/s, 4.263287596404552e-05s/iter
CPU: 544.8127996382856iters/s, 0.0018354928530752658s/iter

on an A100-SXM4-40GB vs. AMD EPYC 7742 using:

import torch
import time

x = torch.rand(1000, 1000, device="cuda")

# warmup
for _ in range(10):
    y = torch.matmul(x, x)

nb_iters = 1000
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(nb_iters):
    y = torch.matmul(x, x)
torch.cuda.synchronize()
t1 = time.perf_counter()
print("GPU: {}iters/s, {}s/iter".format(nb_iters/(t1 - t0), (t1 - t0)/nb_iters))

x = torch.randn(1000, 1000)
# warmup
for _ in range(10):
    y = torch.matmul(x, x)

t0 = time.perf_counter()
for _ in range(nb_iters):
    y = torch.matmul(x, x)
t1 = time.perf_counter()
print("CPU: {}iters/s, {}s/iter".format(nb_iters/(t1 - t0), (t1 - t0)/nb_iters))

EDIT:
The nsys profile also shows the same corresponding runtime reported via my manual profiling approach:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
    100.0         41990974       1010   41575.2   41568.0     41248     42112        113.6  void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_128x128_32x5_nn_align4>(T1::Params)              
      0.0            10752          1   10752.0   10752.0     10752     10752          0.0  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…

Ok, just tried and got something good, but not as good as yours:

GPU: 8068.225570173726iters/s, 0.0001239429898560047s/iter
CPU: 1316.47006870968iters/s, 0.0007596070915460586s/iter

(A100-80GB)

I guess you didn’t enable TF32 for cuBLAS operations as previously mentioned.
With pure FP32 I get ~GPU: 5977.826972784215iters/s, 0.00016728486865758897s/iter as this kernel is used:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
    100.0        161114759       1010  159519.6  159743.0    156352    160383        835.4  ampere_sgemm_128x64_nn                                                                              
      0.0            11040          1   11040.0   11040.0     11040     11040          0.0  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
1 Like