GPU slower than CPU on a simple RNN test code

aleingrosso · March 23, 2017, 8:47pm

Hi,

I wanted to write an RNN from scratch using the pytorch cuda capabilities and I ran some preliminary tests to compare the speed of the CPU vs GPU. The task is very simple and consists of a for loop mimicking the update of the internal state x in an RNN with recurrent weight matrix J. I’m using a Quadro K620 with cuda 8.0.

When the size of x is N=1000 there seems to be a trade-off, with the GPU implementation consistently getting slower when the number of iterations increases (I ran some other tests with different sizes of the J matrix and this behaviour seems pretty systematic).

This is an example of running times I get when running the enclosed script:

cpu: [0.010117292404174805, 0.058980703353881836, 0.45785975456237793, 4.512230634689331]
gpu: [0.0019445419311523438, 0.05474495887756348, 0.7503962516784668, 7.011191129684448]

I’d really appreciate some help on this. Thanks in advance.

The test script is the following:

import numpy as np
import torch as tr
import math
import time

GPUID = 0
tr.cuda.set_device(GPUID)

N = 1000

J = tr.randn(N,N)
x = tr.randn(N)
r = tr.randn(N)
y = tr.randn(N)

Jn = J.numpy()
xn = x.numpy()
rn = r.numpy()
yn = y.numpy()

cputimes = []
for sampl in (100, 1000, 10000, 100000):
    start = time.time()
    for i in xrange(sampl):
        rn = np.tanh(xn)
        xn = Jn.dot(xn);
    end = time.time()
    cputimes.append(end-start)
print(cputimes)

Jc = J.cuda()
xc = x.cuda()
rc = r.cuda()
yc = y.cuda()

gputimes = []
for sampl in (100, 1000, 10000, 100000):
    start = time.time()
    for i in xrange(sampl):
        rc = tr.tanh(xc)
        xc = Jc.mv(xc);
    end = time.time()
    gputimes.append(end-start)
print(gputimes)

wasiahmad · March 31, 2017, 4:18am

I experienced the same thing. A very simple network runs faster in CPU than GPU.

gaojun4ever · July 12, 2017, 4:55am

Can anyone help? I experienced the same thing.

hughperkins · July 12, 2017, 7:43am

Its totally normal that rnns run fast on cpus, compared to gpus. they send lots of teeny-tiny kernel launches, and the gpu cores spend all their time waiting for the next batch to arrive…

hughperkins · July 12, 2017, 7:43am

By the way, you’ll get faster per-example times by increasing the batch size, at the expense that effective learning rate will probably decrease.