GPU utilized 99% but Cudnn not used, extremely slow

Hello!

To work with a remote GPU server running on CentOS 6.9, I had to install pytorch from source (the packaged version as an incompatibility with glibc). I ran 20 batches of my training process with the autograd profiler, and I looked at the trace with chrome://tracing. My local computer, that has a GeForce 1080 Ti, processes a batch 10x faster than the remote server does (which uses one Tesla K80 GPU). The GPUs don’t have the same specs, but I would expect only a 2x slowdown for the Tesla.

I singled out the backward conv operation which is extremely slow on the server. Weirdly, the trace does not show the same function name. Here is the comparison:

Tesla K80       | ThnnConv2DBackward       | 62 ms
GeForce 1080 Ti | CudnnConvolutionBackward | 0.0000037 ms

From the function name, one could deduce that CUDA is not used on the server. But nvidia-smi shows 75-99% utilization during the process. Have I a problem with my pytorch installation?

Thanks for your time!

Did you see the cuDNN version during the build on your server?
Could you check the cuDNN version on the server (if available)?

torch.backends.cudnn.version()

Hello ptrblck_de!

torch.backends.cudnn.version() returns None, while torch.cuda.is_available() returns True!

What does this mean?

It means during the build process cuDNN wasn’t found.
The logs should have shown it as well.
Could you try to install cuDNN and re-build PyTorch?

I should say this is a shared server, provided by a national organization. cuDNN is in fact already installed on this server, but:

  1. It has to be loaded with module load cuda/8.0.44 libs/cuDNN/6
  2. The cuda installation is located in /software-gpu/cuda/8.0.44, and cuDNN is in /software-gpu/libs/cuDNN/6_cuda8.0

Should I do something special to build pytorch correctly? Maybe change $CUDA_HOME?

I don’t have the build logs, I built pytorch about a month ago.

If cuDNN isn’t detected, you could try to specify CUDNN_LIB_DIR (libcudnn.so or something similar should be in the dir) and CUDNN_INCLUDE_DIR (cudnn.h should be in the dir).

Thanks a lot ptrblck, I will try this. Congrats on being a great person.

EDIT:
Here is what I’ve done to compile pytorch

  1. Load python, cuda and cudnn modules
  2. Activate conda environment (torchvision should be already installed)
  3. Set CUDA_HOME, CUDNN_LIB_DIR and CUDNN_INCLUDE_DIR
  4. python setup.py clean && python setup.py install
2 Likes