I have benchmarked the performance of pytorch for a network using 3d convolutions with different gpus on our cluster for two different pytorch builds:
- the conda package from the pytorch channel (
py3.8_cuda11.0.221_cudnn8.0.5_0
) - pytorch build on our cluster with easybuild (I will provide a link to the recipe and build options later)
The performance of the easybuild version is, for some configurations, significantly better than the performance of the conda package; especially for Volta and Ampere cards with half precision.
For details on the benchmarks, see GitHub - constantinpape/3d-unet-benchmarks.
Is it possible that the conda package is built without the correct options to fully leverage the features of the newer architectures like tensorcores?