Backwards pass runs ~50 times slower on 0.5.0 + GPU versus 0.4.0 + CPU

After much effort, I managed to build and install pytorch on my macbook such that I was able to make use of my GPU, using version 0.5.0a0+ba634c1. I also have an older version 0.4.0 with no GPU support in a different conda environment.

I started running my model with the new install and felt it was significantly slower, both using the CPU and the GPU. After timing it, I found that a single backwards pass took 28 seconds using 0.5.0 while the same code took 0.6 seconds with 0.4.0.

What could be slowing it down to such an extent? I’m not sure how to go about debugging the backwards pass…

GPU Supported Info

PyTorch version: 0.5.0a0+ba634c1
Is debug build: No
CUDA used to build PyTorch: 9.2

OS: Mac OSX 10.13.6
GCC version: Could not collect
CMake version: version 3.9.1

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 9.2.148
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/local/cuda/lib/libcudnn.5.dylib
/usr/local/cuda/lib/libcudnn.6.dylib
/usr/local/cuda/lib/libcudnn.7.dylib
/usr/local/cuda/lib/libcudnn.dylib
/usr/local/cuda/lib/libcudnn_static.a
/usr/local/cuda8.0/lib/libcudnn.6.dylib
/usr/local/cuda8.0/lib/libcudnn_static.a

Versions of relevant libraries:
[pip3] numpy (1.14.4)
[pip3] torch (0.4.0)
[conda] torch 0.5.0a0+ba634c1

Note that my GPU is a GeForce 750M, and the output of nvcc --version is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:08:12_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

Non-GPU Supported Info (the older, faster one)

PyTorch version: 0.4.0
Is debug build: No
CUDA used to build PyTorch: Could not collect

OS: Mac OSX 10.13.6
GCC version: Could not collect
CMake version: version 3.12.0

Python version: 3.6
Is CUDA available: No
CUDA runtime version: 9.2.148
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/local/cuda/lib/libcudnn.5.dylib
/usr/local/cuda/lib/libcudnn.6.dylib
/usr/local/cuda/lib/libcudnn.7.dylib
/usr/local/cuda/lib/libcudnn.dylib
/usr/local/cuda/lib/libcudnn_static.a
/usr/local/cuda8.0/lib/libcudnn.6.dylib
/usr/local/cuda8.0/lib/libcudnn_static.a

Versions of relevant libraries:
[pip3] numpy (1.14.4)
[pip3] torch (0.4.0)
[conda] torch 0.4.0
[conda] torchvision 0.1.9 py36_1 soumith

Having compared the results of the Autograd profiler between the two instances, the major bottleneck seems to be that there is a really long “Dropout” call (as in, there are two total, one takes 65 us and the other takes 24,228 us), and a really long “bernoulli_” call (there are two, one takes 15 us and the other take 24,049 us).

Also what seems to be apparent when I run the profiler is that even if I call .cuda() on my model, all the time is in CPU and CUDA is 0.00 for everything…

I do get the warning

Found GPU0 GeForce GT 750M which is of cuda capability 3.0.
    PyTorch no longer supports this GPU because it is too old.

however, everything runs and tensors show device='cuda:0'

Could be because your GPU really isn’t supported anymore. But in any case I would also try updating the NVIDIA drivers. Could you check GPU usage info via e.g., nvidia-smi (in the terminal)?

Your problem may also be related to this thread here: GPU utilized 99% but Cudnn not used, extremely slow

In addition to torch.cuda.is_available() returning true, have you checked that torch.backends.cudnn.version() returns sth other than None?

torch.backends.cudnn.version() returns 7104.

Unfortunately there is no decent command line tool analogous to nvidia-smi for OSX, I’ve used the activity monitor and it does show what appears to be consistent full GPU usage, but it’s a graph only and has no other information so is of limited merit! But it does appear to be in use!

My NVIDIA drivers are also up to date. Maybe it’s just an issue with my GPU not being supported. I will try installing it with NO_CUDA=1 and see if it does the same.

Edit: I installed it with no CUDA and it’s now even slower, it takes 93 seconds versus the 0.6 taken by 0.4.0

I saved the logs from the install: logs here and warnings here

Hm, it does sound like that it’s using your GPU somehow. Maybe it’s not using the GPU versions via CUDA/cuDNN versions for all operations, which is why it’s now slower than before

I can deal with not using my GPU but I would like to use the master, do you have any idea what might be causing the speed differences between CPU 0.4.0 and CPU 0.5.0? Or how I can go about diagnosing a potential cause?

sorry no idea why it would be that slow on CPU now (compared to the CPU performance you got on 0.4)

the problem you have @al3x is that when you compiled pytorch from master, it didn’t detect a good BLAS library (such as MKL or Accelerate and is using unoptimized code).
When you installed 0.4.0 binary, it correctly linked against MKL.

That explains the 28s versus 0.6 seconds.

In our build from source instructions, the critical parts that will make sure you link to MKL are this section:

export CMAKE_PREFIX_PATH=[anaconda root directory]
conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing
1 Like