Hi,
I wanted to write an RNN from scratch using the pytorch cuda capabilities and I ran some preliminary tests to compare the speed of the CPU vs GPU. The task is very simple and consists of a for loop mimicking the update of the internal state x in an RNN with recurrent weight matrix J. I’m using a Quadro K620 with cuda 8.0.
When the size of x is N=1000 there seems to be a trade-off, with the GPU implementation consistently getting slower when the number of iterations increases (I ran some other tests with different sizes of the J matrix and this behaviour seems pretty systematic).
This is an example of running times I get when running the enclosed script:
cpu: [0.010117292404174805, 0.058980703353881836, 0.45785975456237793, 4.512230634689331]
gpu: [0.0019445419311523438, 0.05474495887756348, 0.7503962516784668, 7.011191129684448]
I’d really appreciate some help on this. Thanks in advance.
The test script is the following:
import numpy as np
import torch as tr
import math
import time
GPUID = 0
tr.cuda.set_device(GPUID)
N = 1000
J = tr.randn(N,N)
x = tr.randn(N)
r = tr.randn(N)
y = tr.randn(N)
Jn = J.numpy()
xn = x.numpy()
rn = r.numpy()
yn = y.numpy()
cputimes = []
for sampl in (100, 1000, 10000, 100000):
start = time.time()
for i in xrange(sampl):
rn = np.tanh(xn)
xn = Jn.dot(xn);
end = time.time()
cputimes.append(end-start)
print(cputimes)
Jc = J.cuda()
xc = x.cuda()
rc = r.cuda()
yc = y.cuda()
gputimes = []
for sampl in (100, 1000, 10000, 100000):
start = time.time()
for i in xrange(sampl):
rc = tr.tanh(xc)
xc = Jc.mv(xc);
end = time.time()
gputimes.append(end-start)
print(gputimes)