Hello,

I have been trying to use PyTorch to speed up some simple embarrassingly parallel computations with little success. I am looking for some guidance as to how to speed up the following simple code. Any help would be very much appreciated.

The following functions are to create data to use in the simple example further below:

```
import numpy
import math
import torch
import pandas
import timeit
from timeit import default_timer as timer
def assetPathsCPU(S0,mu,sigma,T,nRows,nPaths):
dt = T/nRows
nudt = (mu-0.5*sigma**2)*dt
sidt = sigma*math.sqrt(dt)
increments = nudt + sidt*numpy.random.randn(int(nRows),int(nPaths))
x=numpy.concatenate((math.log(S0)*numpy.ones((1,int(nPaths))),increments))
pricePaths=numpy.exp(numpy.cumsum(x,axis=0))
return pricePaths
def assetPathsGPU(S0,mu,sigma,T,nRows,nPaths,dtype,device):
dt = T/nRows
nudt = (mu-0.5*sigma**2)*dt
sidt = sigma*torch.sqrt(dt)
pricePaths=torch.exp(torch.cumsum(torch.cat((torch.log(S0)*torch.ones(1,nPaths,dtype=dtype,device=cuda0),
torch.distributions.Normal(nudt,sidt).sample((nRows, nPaths)).squeeze()), dim=0),dim=0))
return pricePaths
```

These are the simple functions - one for the CPU and one for the GPU:

```
def emaNPathsCPU(pricePaths,lookback):
# find T and nPaths
T,nPaths=pricePaths.shape
# create output array
ema=numpy.zeros([int(T),int(nPaths)])
# compute the smoothing constant
a = 2.0 / ( lookback + 1.0 )
# iterate over each price path
for pathIndex in range(0,int(nPaths)):
# iterate over each price path
ema[0,pathIndex] = pricePaths[0,pathIndex]
# iterate over each point in time and compute the EMA
for t in range(1,T):
ema[t,pathIndex]=a * (pricePaths[t,pathIndex]-ema[t-1,
pathIndex]) + ema[t-1,pathIndex]
return ema
def emaNPathsGPU(pricePaths,lookback,dtype,device):
# find T and nPaths
T,nPaths=pricePaths.shape
# create output array
#ema=numpy.zeros([int(T),int(nPaths)])
ema=torch.zeros(T,nPaths,dtype=dtype,device=device)
# compute the smoothing constant
a = 2.0 / ( lookback + 1.0 )
ema[0,:] = pricePaths[0,:]
# iterate over each price path
for pathIndex in range(nPaths):
# iterate over each point in time and compute the EMA
for t in range(1,T):
ema[t,pathIndex]=a * (pricePaths[t,pathIndex]-ema[t-1,
pathIndex]) + ema[t-1,pathIndex]
return ema
```

Here is how I call them:

```
cuda0 = torch.device('cuda:0')
dtype=torch.float64
lookbackGPU=torch.tensor(90.0,dtype=dtype,device=cuda0)
# start timer (EMA)
ts_emaGPU = timer()
# EMA on paths
emaGPU=emaNPathsGPU(pricePathsGPU[:,0:1000],lookbackGPU,dtype,cuda0)
# end timer (prices)
te_emaGPU = timer()
# compute time elasped
timeElasped_emaGPU=te_emaGPU-ts_emaGPU
# display time elasped
print('EMA Time Elasped (GPU): '+str(timeElasped_emaGPU))
```

```
pricePathsGPU_CPU=pricePathsGPU.cpu().numpy()
lookbackCPU=90.0
# start timer (EMA)
ts_emaCPU = timer()
# EMA on paths
emaCPU=emaNPathsCPU(pricePathsGPU_CPU[:,0:1000],lookbackCPU)
# end timer (EMA)
te_emaCPU = timer()
# compute time elasped
timeElasped_emaCPU=te_emaCPU-ts_emaCPU
# display time elasped
print('EMA Time Elasped (CPU): '+str(timeElasped_emaCPU))
```

The link to the example notebook is here:

Why is the GPU version so slow?

I was expecting the ‘path’ loop to run in parallel because it is completely independent. The utilization on the Titan GPU is about 25% while it is running, but is very slow.

What can I do to make the PyTorch version faster?

Any help would be greatly appreciated.

Many thanks,

Derek