Pytorch from pip with CUDA9 faster than from conda CUDA 9 and 10

I tried the pytorch from conda built with CUDA 9 and 10, from pip built with CUDA9 and pytorch built from source within conda environment on training the MNIST: https://github.com/pytorch/examples/tree/master/mnist
The training time for the pip version with CUDA9 was around 71 seconds, for the conda version with CUDA 9 was around 90 seconds and for the conda version with CUDA 10 for the one built from source was around 118 seconds. Will Pytorch release the pytorch 1.2.0 built with CUDA 10 on pip? Or is there a way to build pytorch from source without conda?

The environment:
GPU: 1080Ti
Driver: 430.4
python version: 3.6.8
The torch.backends.cudnn.enabled is True.
Conda is installed with miniconda
Pip version of pytorch is installed in a virtual environment created by virtualenv

Hi,

You can install pytorch without conda from source (I never did it with conda actually).
For best performances, make sure you have the latest cuda and cudnn installed and OpenBLAS if you’re going to do some CPU stuff.

Thinking about it, the difference could come from a different cudnn version bundled in pip and in your conda install.

Hi,
Thank you for the answer. Could you tell me how you installed magma and let the setup.py find it? I tried to put the content from magma-cuda100 on anaconda cloud to the corresponding lib and include folders of the python folder, but in vain.

The cudnn and cuda version in my pip and conda install are the same: cudnn 7602 and cuda 10.0.130.

For magma, the way I did it was by installing it from source at the default location. Then it is found by default.

With the same cudnn versions, I’m not sure why you see that difference though…

As your timing differences are in the range of values I just measured to verify a fix of https://github.com/pytorch/pytorch/issues/25010 , you might want to try setting pin_memory=False in the main.py and see if that improves things for your slow cases…

My timings running the mnist train for the default 10 epochs. +/- 1 second is in the noise, there is obviously a big issue in 1.2 (and I see it in 1.1). I don’t see it in 1.0. Interestingly there is a diff btw 1.0 and 1.2 as 57 vs 63-64 is not in the measurement noise. I don’t have numbers across different cuda versions or pip vs conda though.

PyTorch 1.2 Conda, Python 3.7.3, Cuda 10.0.130, CUDNN 7.6.2

pin_memory=True, num_workers=1, no fix applied: 131.03s
pin_memory=True, num_workers=1, fix applied: 63.5s
pin_memory=False, num_workers=1: 63.25s
pin_memory=True, num_workers=0, no fix applied: 66.70s
pin_memory=True, num_workers=0, fix applied: 64.42s
pin_memory=False, num_workers=0: 64.03s

PyTorch 1.0, Conda, Python 3.6.8, Cuda 10.0.130, CUDNN 7.4.1

pin_memory=True, num_workers=1: 57.83
pin_memory=False, num_workers=1: 57.48
pin_memory=True, num_workers=0: 57.32
pin_memory=False, num_workers=0: 56.95
2 Likes

Thank you very much! This is really the answer to the difference! After setting pin_memory=False, the performance of PyTorch 1.2 with CUDA 9 and of the one with CUDA 10 were the same. I didn’t test 1.0 and that’s really interesting to see 1.0 is even faster.