Performance issues with pytorch conda package

I have benchmarked the performance of pytorch for a network using 3d convolutions with different gpus on our cluster for two different pytorch builds:

  • the conda package from the pytorch channel (py3.8_cuda11.0.221_cudnn8.0.5_0)
  • pytorch build on our cluster with easybuild (I will provide a link to the recipe and build options later)

The performance of the easybuild version is, for some configurations, significantly better than the performance of the conda package; especially for Volta and Ampere cards with half precision.
For details on the benchmarks, see GitHub - constantinpape/3d-unet-benchmarks.

Is it possible that the conda package is built without the correct options to fully leverage the features of the newer architectures like tensorcores?

Which CUDA and cudnn version are you using in your source build?
Besides different versions you might also be hitting this issue, which we are working on.

The source build uses CUDA 11.1.1 and CUDNN 8.0.4
The issue you have linked looks relevant, I will have a closer look.