Bypass autograd / cuda queue overhead

bdube · March 27, 2018, 8:43pm

Hello,

I would like to use Pytorch to accelerate some compute tasks that do not involve neural nets.

I essentially compute some data over a fixed grid many thousands of times with different weights, optimizing the weights.

My array sizes are 128x128, and my functions involve mostly sin / cos, raising to a power, etc.

If I use numpy or CPU tensors, the performance is quite similar. With GPU tensors, I only attain slightly better performance without caching lots of intermediates.

I think this is because there is overhead in instructing the GPU, or the eventloop driving the CUDA queue is not granular enough?

Some setup;

import numpy as np
import torch
from numba import vectorize

def thmeshgrid(x, y):
    xx = x.view(-1, 1).repeat(1, y.shape[0])
    yy = y.repeat(x.shape[0], 1)
    return xx, yy

def thZ45(rho, phi):
    return (210 * rho**10 - 504 * rho**8 + 420 * rho**6 - 140 * rho**4 + 15 * rho**2) \
        * torch.sin(2 * phi)

def thZ46(rho, phi):
    return (462 * rho**11 - 1260 * rho**9 + 1260 * rho**7 - 560 * rho**5 + 105 * rho**3 - 6 * rho) \
        * torch.cos(phi)

def thZ47(rho, phi):
    return (462 * rho**11 - 1260 * rho**9 + 1260 * rho**7 - 560 * rho**5 + 105 * rho**3 - 6 * rho) \
        * torch.sin(phi)

def thZ48(rho, phi):
    return 924 * rho**12 \
        - 2772 * rho**10 \
        + 3150 * rho**8 \
        - 1680 * rho**6 \
        + 420 * rho**4 \
        - 42 * rho**2 \
        + 1

@vectorize
def Z45(rho, phi):
    return (210 * rho**10 - 504 * rho**8 + 420 * rho**6 - 140 * rho**4 + 15 * rho**2) \
        * sin(2 * phi)

@vectorize
def Z46(rho, phi):
    return (462 * rho**11 - 1260 * rho**9 + 1260 * rho**7 - 560 * rho**5 + 105 * rho**3 - 6 * rho) \
        * cos(phi)

@vectorize
def Z47(rho, phi):
    return (462 * rho**11 - 1260 * rho**9 + 1260 * rho**7 - 560 * rho**5 + 105 * rho**3 - 6 * rho) \
        * sin(phi)

@vectorize
def Z48(rho, phi):
    return 924 * rho**12 \
        - 2772 * rho**10 \
        + 3150 * rho**8 \
        - 1680 * rho**6 \
        + 420 * rho**4 \
        - 42 * rho**2 \
        + 1

Vectorize is just used to fuse the trig and exponential kernels and reduce it to one pass over the array.

Now the algorithm on numpy;

%%timeit
x = np.linspace(-1, 1, 128)
y = np.linspace(-1, 1, 128)
xx, yy = np.meshgrid(x, y)
rho, phi = np.sqrt(xx**2 + yy**2), np.arctan2(yy, xx)
phase = Z45(rho, phi) + Z46(rho, phi) + Z47(rho, phi) + Z48(rho, phi)
err = np.exp(2 * np.pi * phase / 1e6)
2.37 ms ± 276 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

th-cuda:

%%timeit
x = torch.linspace(-1, 1, 128).cuda()
y = torch.linspace(-1, 1, 128).cuda()
xx, yy = thmeshgrid(x, y)
rho, phi = torch.sqrt(xx**2 + yy**2), torch.atan2(yy, xx)
phase = thZ45(rho, phi) + thZ46(rho, phi) + thZ47(rho, phi) + thZ48(rho, phi)
err = torch.exp(2 * np.pi * phase / 1e6)
3.41 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

These timings are on battery as I write this, but even on AC power the GPU fails to match the CPU. In a highly optimized version with the values of thZ45, etc, cached and simply summing them and passing that to torch.exp, the torch implementation takes 192us while numpy takes 557us. I would expect greater benefit from CUDA to run these functions, and it seems there is ~20us overhead associated with each call involving cuda tensors. Can this be bypassed, or is it e.g. latency to put something on the GPU’s work queue?

I’m doing this on a Dell XPS 15 9560 with 32GB of RAM and a GTX 1050 GPU. Perhaps coordinating the GPU would be faster on a proper desktop PC?

SimonW · March 27, 2018, 9:42pm

This is unfair comparison. Here you are building things in CPU and then transfer to GPU. A better comparison is to measure the computation time after constructing tensors.

SimonW · March 27, 2018, 9:43pm

Also, assuming that you are using a release, you are running on tensors which dont have autograd overhead.

bdube · March 27, 2018, 10:16pm

I don’t know where the overhead comes from.

There is a flat cost of ~20us for any operation on a cuda tensor, vs an unmeasurable small time on CPU.

E.g. use

x = torch.linspace(-1, 1, 128)

%timeit torch.sin(x)

for me, that is ~2us.

repeat with

x = x.cuda()
%timeit torch.sin(x)

=> ~21us.

The computational overhead isn’t putting x and y on the GPU.