.cuda() is very slow!

Hi
I have installed cuda 9.1.85 and pytorch v0.3.1 pre-built binary:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85
$ inxi -G
Graphics:  Card-1: Intel HD Graphics 520 driver: i915 v: kernel 
           Card-2: NVIDIA GM108M [GeForce 920MX] driver: N/A 
           Display Server: x11 (X.Org 1.19.6) driver: none unloaded: intel resolution: 1920x1080~60Hz 
           OpenGL: renderer: Mesa DRI Intel HD Graphics 520 (Skylake GT2) version: 4.5 Mesa 17.3.7 
           direct render: Yes 

And this is a test i’ve run:

# test_cuda.py
import torch
from datetime import datetime

for i in range(10):
    x = torch.randn(10, 10, 10, 10)
    t1 = datetime.now()
    x.cuda()
    print(i, datetime.now() - t1)

For pre-built pytorch the result is fast enough but for a more complicated example(which uses something like my_model.cuda()) i get the no kernel image error:
RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device.

So i’ve compiled pytorch from the source:

$ git clone https://github.com/pytorch/pytorch.git
$ git checkout v0.3.1
$ export CC=gcc-6
$ export CXX=g++-6
$ python setup.py install

After that i’ve run the simple test again and now i’ve got very slow result on gpu :frowning_face::

Found GPU0 GeForce 920MX which is of cuda capability 5.0.
    PyTorch no longer supports this GPU because it is too old.
    
  warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
0 0:00:02.086585
1 0:00:00.000071
2 0:00:00.000053
3 0:00:00.000051
4 0:00:00.000051
5 0:00:00.000052
6 0:00:00.000065
7 0:00:00.000052
8 0:00:00.000051
9 0:00:00.000052

So what sould i do next? probably changing my laptop?!

What’s the expected time for this? Notice that the first cuda operation is always slower due to initializing context.

something like this:

0 0:00:00.000001
1 0:00:00
2 0:00:00.000001
3 0:00:00
4 0:00:00.000001
5 0:00:00.000005
6 0:00:00.000004
7 0:00:00.000008
8 0:00:00.000007
9 0:00:00.000006

Regardless of this simple test, for a real example it takes much longer than cpu.

Notice the warning: it’s not using GPU because it’s not compatible

920MX is not supported by official package since 0.3.1. User needs to compile pytorch by him/herself to gain cuda support.

I did compile pytorch from the source and that was the result after successful compilation.

I didn’t look into the .cuda() implementation, but I guess there must by an cudaMemcpy somewhere.
Memory copy between CPU and GPU just takes much time (it should be hardware interruption) and is not likely to be accelerated by several lines of code.

Sorry, but may I ask where you got these numbers from? I’m not questioning your theory. I just personally found the original numbers reasonable, and am wondering if you saw better numbers using a previous version or another framework.

This is the test code for cuda:

# test_cuda.py
import torch
from datetime import datetime

for i in range(10):
    x = torch.randn(10, 10, 10, 10)
    t1 = datetime.now()
    x.cuda()
    print(i, datetime.now() - t1)

and the result for my compiled torch on cuda 9.1.85 is:

Found GPU0 GeForce 920MX which is of cuda capability 5.0.
    PyTorch no longer supports this GPU because it is too old.
    
  warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
0 0:00:02.086585
1 0:00:00.000071
2 0:00:00.000053
3 0:00:00.000051
4 0:00:00.000051
5 0:00:00.000052
6 0:00:00.000065
7 0:00:00.000052
8 0:00:00.000051
9 0:00:00.000052

but same code for cpu:

for i in range(10):
    x = torch.randn(10, 10, 10, 10)
    t1 = datetime.now()
    #x.cuda()
    print(i, datetime.now() - t1)

gives me faster result:

0 0:00:00.000001
1 0:00:00.000001
2 0:00:00.000001
3 0:00:00.000001
4 0:00:00.000001
5 0:00:00.000003
6 0:00:00.000004
7 0:00:00.000004
8 0:00:00.000005
9 0:00:00.000004

But as i said i’m not relying on this test code, actually i’ve run some actual codes for CNN or GANs which are faster on cpu on my system.

…why do you even expect “sampling on cpu, and then copying to gpu” to be faster than just “sampling on cpu”?

These are fractions of a second (0.5 microseconds). Given that you have to do this only 1 time per iteration (or 2 times if you have a target array), this is super negligible. Or in other words, after 200 hundred iterations (which probably take minutes to hours depending on your architecture) you lose 1 second