Hi, I’ve just switched to a cluster with an A100 GPU, but I’m seeing worse performances than what I had on the previous card I was using (i.e. V100). By looking at other discussions, I believe it could be a cudnn version related issue.
I’m working with an installation of pytorch in a conda environment, with the following specifications:
I’ve read online that the CUDA version should be 11.x, so there should not be any problem, since the one installed is 11.6.
Are there any recommended cudnn versions (or torch versions) for working with an A100? Can the problem be solved via a conda install?
In case you are using the default float32 for your model training you might consider enabling TF32 for cuBLAS operations via: torch.backends.cuda.matmul.allow_tf32 = True and see if this would give you the desired speedup.
Also, I would recommend updating to the latest PyTorch release with the latest CUDA runtime.
Sorry for the late reply, but I didn’t have access to the cluster this week.
After running some tests, and despite using the command you recommended, I get the following times for multiplying two 1000x1000 matrices:
cpu time 0.00585627555847168
gpu time 0.8221950531005859
I have also reinstalled the new version and also installed the cudatoolkit-dev, because I saw that the nvcc command was missing, but nothing.
I guess you didn’t enable TF32 for cuBLAS operations as previously mentioned.
With pure FP32 I get ~GPU: 5977.826972784215iters/s, 0.00016728486865758897s/iter as this kernel is used:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
100.0 161114759 1010 159519.6 159743.0 156352 160383 835.4 ampere_sgemm_128x64_nn
0.0 11040 1 11040.0 11040.0 11040 11040 0.0 void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…