.cuda() is very slow!

mese79 · March 24, 2018, 4:01pm

Hi
I have installed cuda 9.1.85 and pytorch v0.3.1 pre-built binary:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

$ inxi -G
Graphics:  Card-1: Intel HD Graphics 520 driver: i915 v: kernel 
           Card-2: NVIDIA GM108M [GeForce 920MX] driver: N/A 
           Display Server: x11 (X.Org 1.19.6) driver: none unloaded: intel resolution: 1920x1080~60Hz 
           OpenGL: renderer: Mesa DRI Intel HD Graphics 520 (Skylake GT2) version: 4.5 Mesa 17.3.7 
           direct render: Yes

And this is a test i’ve run:

# test_cuda.py
import torch
from datetime import datetime

for i in range(10):
    x = torch.randn(10, 10, 10, 10)
    t1 = datetime.now()
    x.cuda()
    print(i, datetime.now() - t1)

For pre-built pytorch the result is fast enough but for a more complicated example(which uses something like my_model.cuda()) i get the no kernel image error:
RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device.

So i’ve compiled pytorch from the source:

$ git clone https://github.com/pytorch/pytorch.git
$ git checkout v0.3.1
$ export CC=gcc-6
$ export CXX=g++-6
$ python setup.py install

After that i’ve run the simple test again and now i’ve got very slow result on gpu :

Found GPU0 GeForce 920MX which is of cuda capability 5.0.
    PyTorch no longer supports this GPU because it is too old.
    
  warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
0 0:00:02.086585
1 0:00:00.000071
2 0:00:00.000053
3 0:00:00.000051
4 0:00:00.000051
5 0:00:00.000052
6 0:00:00.000065
7 0:00:00.000052
8 0:00:00.000051
9 0:00:00.000052

So what sould i do next? probably changing my laptop?!

SimonW · March 24, 2018, 9:00pm

What’s the expected time for this? Notice that the first cuda operation is always slower due to initializing context.

mese79 · March 25, 2018, 1:52pm

something like this:

0 0:00:00.000001
1 0:00:00
2 0:00:00.000001
3 0:00:00
4 0:00:00.000001
5 0:00:00.000005
6 0:00:00.000004
7 0:00:00.000008
8 0:00:00.000007
9 0:00:00.000006

Regardless of this simple test, for a real example it takes much longer than cpu.

miguelvr · March 25, 2018, 2:01pm

Notice the warning: it’s not using GPU because it’s not compatible

cdluminate · March 25, 2018, 2:02pm

920MX is not supported by official package since 0.3.1. User needs to compile pytorch by him/herself to gain cuda support.

mese79 · March 25, 2018, 2:07pm

I did compile pytorch from the source and that was the result after successful compilation.

cdluminate · March 25, 2018, 2:11pm

I didn’t look into the .cuda() implementation, but I guess there must by an cudaMemcpy somewhere.
Memory copy between CPU and GPU just takes much time (it should be hardware interruption) and is not likely to be accelerated by several lines of code.

SimonW · March 25, 2018, 5:52pm

Sorry, but may I ask where you got these numbers from? I’m not questioning your theory. I just personally found the original numbers reasonable, and am wondering if you saw better numbers using a previous version or another framework.

mese79 · March 25, 2018, 7:11pm

This is the test code for cuda:

# test_cuda.py
import torch
from datetime import datetime

for i in range(10):
    x = torch.randn(10, 10, 10, 10)
    t1 = datetime.now()
    x.cuda()
    print(i, datetime.now() - t1)

and the result for my compiled torch on cuda 9.1.85 is:

Found GPU0 GeForce 920MX which is of cuda capability 5.0.
    PyTorch no longer supports this GPU because it is too old.
    
  warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
0 0:00:02.086585
1 0:00:00.000071
2 0:00:00.000053
3 0:00:00.000051
4 0:00:00.000051
5 0:00:00.000052
6 0:00:00.000065
7 0:00:00.000052
8 0:00:00.000051
9 0:00:00.000052

but same code for cpu:

for i in range(10):
    x = torch.randn(10, 10, 10, 10)
    t1 = datetime.now()
    #x.cuda()
    print(i, datetime.now() - t1)

gives me faster result:

0 0:00:00.000001
1 0:00:00.000001
2 0:00:00.000001
3 0:00:00.000001
4 0:00:00.000001
5 0:00:00.000003
6 0:00:00.000004
7 0:00:00.000004
8 0:00:00.000005
9 0:00:00.000004

But as i said i’m not relying on this test code, actually i’ve run some actual codes for CNN or GANs which are faster on cpu on my system.

SimonW · March 26, 2018, 3:07am

…why do you even expect “sampling on cpu, and then copying to gpu” to be faster than just “sampling on cpu”?

rasbt · March 26, 2018, 3:28am

These are fractions of a second (0.5 microseconds). Given that you have to do this only 1 time per iteration (or 2 times if you have a target array), this is super negligible. Or in other words, after 200 hundred iterations (which probably take minutes to hours depending on your architecture) you lose 1 second