To work with a remote GPU server running on CentOS 6.9, I had to install pytorch from source (the packaged version as an incompatibility with glibc). I ran 20 batches of my training process with the autograd profiler, and I looked at the trace with
chrome://tracing. My local computer, that has a GeForce 1080 Ti, processes a batch 10x faster than the remote server does (which uses one Tesla K80 GPU). The GPUs don’t have the same specs, but I would expect only a 2x slowdown for the Tesla.
I singled out the backward conv operation which is extremely slow on the server. Weirdly, the trace does not show the same function name. Here is the comparison:
Tesla K80 | ThnnConv2DBackward | 62 ms
GeForce 1080 Ti | CudnnConvolutionBackward | 0.0000037 ms
From the function name, one could deduce that CUDA is not used on the server. But nvidia-smi shows 75-99% utilization during the process. Have I a problem with my pytorch installation?
Thanks for your time!
Did you see the cuDNN version during the build on your server?
Could you check the cuDNN version on the server (if available)?
What does this mean?
It means during the build process cuDNN wasn’t found.
The logs should have shown it as well.
Could you try to install cuDNN and re-build PyTorch?
I should say this is a shared server, provided by a national organization. cuDNN is in fact already installed on this server, but:
- It has to be loaded with
module load cuda/8.0.44 libs/cuDNN/6
- The cuda installation is located in
/software-gpu/cuda/8.0.44, and cuDNN is in
Should I do something special to build pytorch correctly? Maybe change
I don’t have the build logs, I built pytorch about a month ago.
If cuDNN isn’t detected, you could try to specify
libcudnn.so or something similar should be in the dir) and
cudnn.h should be in the dir).
Thanks a lot ptrblck, I will try this. Congrats on being a great person.
Here is what I’ve done to compile pytorch
- Load python, cuda and cudnn modules
- Activate conda environment (torchvision should be already installed)
- Set CUDA_HOME, CUDNN_LIB_DIR and CUDNN_INCLUDE_DIR
- python setup.py clean && python setup.py install