Hello,

I would like to use Pytorch to accelerate some compute tasks that do not involve neural nets.

I essentially compute some data over a fixed grid many thousands of times with different weights, optimizing the weights.

My array sizes are 128x128, and my functions involve mostly sin / cos, raising to a power, etc.

If I use numpy or CPU tensors, the performance is quite similar. With GPU tensors, I only attain slightly better performance without caching lots of intermediates.

I think this is because there is overhead in instructing the GPU, or the eventloop driving the CUDA queue is not granular enough?

Some setup;

```
import numpy as np
import torch
from numba import vectorize
def thmeshgrid(x, y):
xx = x.view(-1, 1).repeat(1, y.shape[0])
yy = y.repeat(x.shape[0], 1)
return xx, yy
def thZ45(rho, phi):
return (210 * rho**10 - 504 * rho**8 + 420 * rho**6 - 140 * rho**4 + 15 * rho**2) \
* torch.sin(2 * phi)
def thZ46(rho, phi):
return (462 * rho**11 - 1260 * rho**9 + 1260 * rho**7 - 560 * rho**5 + 105 * rho**3 - 6 * rho) \
* torch.cos(phi)
def thZ47(rho, phi):
return (462 * rho**11 - 1260 * rho**9 + 1260 * rho**7 - 560 * rho**5 + 105 * rho**3 - 6 * rho) \
* torch.sin(phi)
def thZ48(rho, phi):
return 924 * rho**12 \
- 2772 * rho**10 \
+ 3150 * rho**8 \
- 1680 * rho**6 \
+ 420 * rho**4 \
- 42 * rho**2 \
+ 1
@vectorize
def Z45(rho, phi):
return (210 * rho**10 - 504 * rho**8 + 420 * rho**6 - 140 * rho**4 + 15 * rho**2) \
* sin(2 * phi)
@vectorize
def Z46(rho, phi):
return (462 * rho**11 - 1260 * rho**9 + 1260 * rho**7 - 560 * rho**5 + 105 * rho**3 - 6 * rho) \
* cos(phi)
@vectorize
def Z47(rho, phi):
return (462 * rho**11 - 1260 * rho**9 + 1260 * rho**7 - 560 * rho**5 + 105 * rho**3 - 6 * rho) \
* sin(phi)
@vectorize
def Z48(rho, phi):
return 924 * rho**12 \
- 2772 * rho**10 \
+ 3150 * rho**8 \
- 1680 * rho**6 \
+ 420 * rho**4 \
- 42 * rho**2 \
+ 1
```

Vectorize is just used to fuse the trig and exponential kernels and reduce it to one pass over the array.

Now the algorithm on numpy;

```
%%timeit
x = np.linspace(-1, 1, 128)
y = np.linspace(-1, 1, 128)
xx, yy = np.meshgrid(x, y)
rho, phi = np.sqrt(xx**2 + yy**2), np.arctan2(yy, xx)
phase = Z45(rho, phi) + Z46(rho, phi) + Z47(rho, phi) + Z48(rho, phi)
err = np.exp(2 * np.pi * phase / 1e6)
2.37 ms ± 276 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

th-cuda:

```
%%timeit
x = torch.linspace(-1, 1, 128).cuda()
y = torch.linspace(-1, 1, 128).cuda()
xx, yy = thmeshgrid(x, y)
rho, phi = torch.sqrt(xx**2 + yy**2), torch.atan2(yy, xx)
phase = thZ45(rho, phi) + thZ46(rho, phi) + thZ47(rho, phi) + thZ48(rho, phi)
err = torch.exp(2 * np.pi * phase / 1e6)
3.41 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

These timings are on battery as I write this, but even on AC power the GPU fails to match the CPU. In a highly optimized version with the values of thZ45, etc, cached and simply summing them and passing that to torch.exp, the torch implementation takes 192us while numpy takes 557us. I would expect greater benefit from CUDA to run these functions, and it seems there is ~20us overhead associated with each call involving cuda tensors. Can this be bypassed, or is it e.g. latency to put something on the GPU’s work queue?

I’m doing this on a Dell XPS 15 9560 with 32GB of RAM and a GTX 1050 GPU. Perhaps coordinating the GPU would be faster on a proper desktop PC?