GPU utilized 99% but Cudnn not used, extremely slow

Carl · July 11, 2018, 7:48pm

Hello!

To work with a remote GPU server running on CentOS 6.9, I had to install pytorch from source (the packaged version as an incompatibility with glibc). I ran 20 batches of my training process with the autograd profiler, and I looked at the trace with chrome://tracing. My local computer, that has a GeForce 1080 Ti, processes a batch 10x faster than the remote server does (which uses one Tesla K80 GPU). The GPUs don’t have the same specs, but I would expect only a 2x slowdown for the Tesla.

I singled out the backward conv operation which is extremely slow on the server. Weirdly, the trace does not show the same function name. Here is the comparison:

Tesla K80       | ThnnConv2DBackward       | 62 ms
GeForce 1080 Ti | CudnnConvolutionBackward | 0.0000037 ms

From the function name, one could deduce that CUDA is not used on the server. But nvidia-smi shows 75-99% utilization during the process. Have I a problem with my pytorch installation?

Thanks for your time!

ptrblck · July 11, 2018, 7:51pm

Did you see the cuDNN version during the build on your server?
Could you check the cuDNN version on the server (if available)?

torch.backends.cudnn.version()

Carl · July 11, 2018, 7:59pm

Hello ptrblck_de!

torch.backends.cudnn.version() returns None, while torch.cuda.is_available() returns True!

What does this mean?

ptrblck · July 11, 2018, 8:04pm

It means during the build process cuDNN wasn’t found.
The logs should have shown it as well.
Could you try to install cuDNN and re-build PyTorch?

Carl · July 11, 2018, 8:11pm

I should say this is a shared server, provided by a national organization. cuDNN is in fact already installed on this server, but:

It has to be loaded with module load cuda/8.0.44 libs/cuDNN/6
The cuda installation is located in /software-gpu/cuda/8.0.44, and cuDNN is in /software-gpu/libs/cuDNN/6_cuda8.0

Should I do something special to build pytorch correctly? Maybe change $CUDA_HOME?

I don’t have the build logs, I built pytorch about a month ago.

ptrblck · July 11, 2018, 8:19pm

If cuDNN isn’t detected, you could try to specify CUDNN_LIB_DIR (libcudnn.so or something similar should be in the dir) and CUDNN_INCLUDE_DIR (cudnn.h should be in the dir).

Carl · July 11, 2018, 8:22pm

Thanks a lot ptrblck, I will try this. Congrats on being a great person.

EDIT:
Here is what I’ve done to compile pytorch

Load python, cuda and cudnn modules
Activate conda environment (torchvision should be already installed)
Set CUDA_HOME, CUDNN_LIB_DIR and CUDNN_INCLUDE_DIR
python setup.py clean && python setup.py install