I wanted to write an RNN from scratch using the pytorch cuda capabilities and I ran some preliminary tests to compare the speed of the CPU vs GPU. The task is very simple and consists of a for loop mimicking the update of the internal state x in an RNN with recurrent weight matrix J. I’m using a Quadro K620 with cuda 8.0.
When the size of x is N=1000 there seems to be a trade-off, with the GPU implementation consistently getting slower when the number of iterations increases (I ran some other tests with different sizes of the J matrix and this behaviour seems pretty systematic).
This is an example of running times I get when running the enclosed script:
cpu: [0.010117292404174805, 0.058980703353881836, 0.45785975456237793, 4.512230634689331] gpu: [0.0019445419311523438, 0.05474495887756348, 0.7503962516784668, 7.011191129684448]
I’d really appreciate some help on this. Thanks in advance.
The test script is the following:
import numpy as np import torch as tr import math import time GPUID = 0 tr.cuda.set_device(GPUID) N = 1000 J = tr.randn(N,N) x = tr.randn(N) r = tr.randn(N) y = tr.randn(N) Jn = J.numpy() xn = x.numpy() rn = r.numpy() yn = y.numpy() cputimes =  for sampl in (100, 1000, 10000, 100000): start = time.time() for i in xrange(sampl): rn = np.tanh(xn) xn = Jn.dot(xn); end = time.time() cputimes.append(end-start) print(cputimes) Jc = J.cuda() xc = x.cuda() rc = r.cuda() yc = y.cuda() gputimes =  for sampl in (100, 1000, 10000, 100000): start = time.time() for i in xrange(sampl): rc = tr.tanh(xc) xc = Jc.mv(xc); end = time.time() gputimes.append(end-start) print(gputimes)