# Speeding Up Loops on GPU

Hello,

I have been trying to use PyTorch to speed up some simple embarrassingly parallel computations with little success. I am looking for some guidance as to how to speed up the following simple code. Any help would be very much appreciated.

The following functions are to create data to use in the simple example further below:

``````import numpy
import math
import torch
import pandas
import timeit
from timeit import default_timer as timer

def assetPathsCPU(S0,mu,sigma,T,nRows,nPaths):
dt = T/nRows
nudt = (mu-0.5*sigma**2)*dt
sidt = sigma*math.sqrt(dt)
increments = nudt + sidt*numpy.random.randn(int(nRows),int(nPaths))
x=numpy.concatenate((math.log(S0)*numpy.ones((1,int(nPaths))),increments))
pricePaths=numpy.exp(numpy.cumsum(x,axis=0))

return pricePaths

def assetPathsGPU(S0,mu,sigma,T,nRows,nPaths,dtype,device):
dt = T/nRows
nudt = (mu-0.5*sigma**2)*dt
sidt = sigma*torch.sqrt(dt)
pricePaths=torch.exp(torch.cumsum(torch.cat((torch.log(S0)*torch.ones(1,nPaths,dtype=dtype,device=cuda0),
torch.distributions.Normal(nudt,sidt).sample((nRows, nPaths)).squeeze()), dim=0),dim=0))

return pricePaths
``````

These are the simple functions - one for the CPU and one for the GPU:

``````def emaNPathsCPU(pricePaths,lookback):
# find T and nPaths
T,nPaths=pricePaths.shape
# create output array
ema=numpy.zeros([int(T),int(nPaths)])
# compute the smoothing constant
a = 2.0 / ( lookback + 1.0 )
# iterate over each price path
for pathIndex in range(0,int(nPaths)):
# iterate over each price path
ema[0,pathIndex] = pricePaths[0,pathIndex]
# iterate over each point in time and compute the EMA
for t in range(1,T):
ema[t,pathIndex]=a * (pricePaths[t,pathIndex]-ema[t-1,
pathIndex]) + ema[t-1,pathIndex]
return ema

def emaNPathsGPU(pricePaths,lookback,dtype,device):
# find T and nPaths
T,nPaths=pricePaths.shape
# create output array
#ema=numpy.zeros([int(T),int(nPaths)])
ema=torch.zeros(T,nPaths,dtype=dtype,device=device)
# compute the smoothing constant
a = 2.0 / ( lookback + 1.0 )
ema[0,:] = pricePaths[0,:]
# iterate over each price path
for pathIndex in range(nPaths):
# iterate over each point in time and compute the EMA
for t in range(1,T):
ema[t,pathIndex]=a * (pricePaths[t,pathIndex]-ema[t-1,
pathIndex]) + ema[t-1,pathIndex]
return ema
``````

Here is how I call them:

``````cuda0 = torch.device('cuda:0')
dtype=torch.float64
lookbackGPU=torch.tensor(90.0,dtype=dtype,device=cuda0)

# start timer (EMA)
ts_emaGPU = timer()
# EMA on paths
emaGPU=emaNPathsGPU(pricePathsGPU[:,0:1000],lookbackGPU,dtype,cuda0)
# end timer (prices)
te_emaGPU = timer()
# compute time elasped
timeElasped_emaGPU=te_emaGPU-ts_emaGPU
# display time elasped
print('EMA Time Elasped (GPU): '+str(timeElasped_emaGPU))
``````
``````pricePathsGPU_CPU=pricePathsGPU.cpu().numpy()
lookbackCPU=90.0
# start timer (EMA)
ts_emaCPU = timer()
# EMA on paths
emaCPU=emaNPathsCPU(pricePathsGPU_CPU[:,0:1000],lookbackCPU)
# end timer (EMA)
te_emaCPU = timer()
# compute time elasped
timeElasped_emaCPU=te_emaCPU-ts_emaCPU
# display time elasped
print('EMA Time Elasped (CPU): '+str(timeElasped_emaCPU))
``````

The link to the example notebook is here:

Why is the GPU version so slow?

I was expecting the ‘path’ loop to run in parallel because it is completely independent. The utilization on the Titan GPU is about 25% while it is running, but is very slow.

What can I do to make the PyTorch version faster?

Any help would be greatly appreciated.

Many thanks,
Derek

The parallelism offered by the GPU is primarily per-op parallelism (there is something to be said about cuda streams, but that probably isn’t what you want to do here). The best way to speed this up is to vectorize over your Monte Carlo paths (possibly serializing over time).
As you appear to be looking into financial mathematics: I did something like that for the Hull-White-Model a long time ago (when we had `Variable`…).

Best regards

Thomas