Speeding Up Loops on GPU

Hello,

I have been trying to use PyTorch to speed up some simple embarrassingly parallel computations with little success. I am looking for some guidance as to how to speed up the following simple code. Any help would be very much appreciated.

The following functions are to create data to use in the simple example further below:

import numpy
import math
import torch
import pandas
import timeit
from timeit import default_timer as timer

def assetPathsCPU(S0,mu,sigma,T,nRows,nPaths):
    dt = T/nRows
    nudt = (mu-0.5*sigma**2)*dt    
    sidt = sigma*math.sqrt(dt)
    increments = nudt + sidt*numpy.random.randn(int(nRows),int(nPaths))
    x=numpy.concatenate((math.log(S0)*numpy.ones((1,int(nPaths))),increments))
    pricePaths=numpy.exp(numpy.cumsum(x,axis=0))

    return pricePaths

def assetPathsGPU(S0,mu,sigma,T,nRows,nPaths,dtype,device):
    dt = T/nRows
    nudt = (mu-0.5*sigma**2)*dt
    sidt = sigma*torch.sqrt(dt)
    pricePaths=torch.exp(torch.cumsum(torch.cat((torch.log(S0)*torch.ones(1,nPaths,dtype=dtype,device=cuda0),
        torch.distributions.Normal(nudt,sidt).sample((nRows, nPaths)).squeeze()), dim=0),dim=0))
    
    return pricePaths

These are the simple functions - one for the CPU and one for the GPU:

def emaNPathsCPU(pricePaths,lookback):
    # find T and nPaths
    T,nPaths=pricePaths.shape
    # create output array
    ema=numpy.zeros([int(T),int(nPaths)])
    # compute the smoothing constant
    a = 2.0 / ( lookback + 1.0 )
    # iterate over each price path
    for pathIndex in range(0,int(nPaths)):
        # iterate over each price path
        ema[0,pathIndex] = pricePaths[0,pathIndex]
        # iterate over each point in time and compute the EMA 
        for t in range(1,T):
            ema[t,pathIndex]=a * (pricePaths[t,pathIndex]-ema[t-1,
                pathIndex]) + ema[t-1,pathIndex]
    return ema

def emaNPathsGPU(pricePaths,lookback,dtype,device):
    # find T and nPaths
    T,nPaths=pricePaths.shape
    # create output array
    #ema=numpy.zeros([int(T),int(nPaths)])
    ema=torch.zeros(T,nPaths,dtype=dtype,device=device)  
    # compute the smoothing constant
    a = 2.0 / ( lookback + 1.0 )
    ema[0,:] = pricePaths[0,:]    
    # iterate over each price path
    for pathIndex in range(nPaths):
        # iterate over each point in time and compute the EMA 
        for t in range(1,T):
            ema[t,pathIndex]=a * (pricePaths[t,pathIndex]-ema[t-1,
                pathIndex]) + ema[t-1,pathIndex]
    return ema

Here is how I call them:

cuda0 = torch.device('cuda:0')
dtype=torch.float64
lookbackGPU=torch.tensor(90.0,dtype=dtype,device=cuda0)

# start timer (EMA)
ts_emaGPU = timer()
# EMA on paths
emaGPU=emaNPathsGPU(pricePathsGPU[:,0:1000],lookbackGPU,dtype,cuda0)
# end timer (prices)
te_emaGPU = timer()
# compute time elasped
timeElasped_emaGPU=te_emaGPU-ts_emaGPU
# display time elasped
print('EMA Time Elasped (GPU): '+str(timeElasped_emaGPU))
pricePathsGPU_CPU=pricePathsGPU.cpu().numpy()
lookbackCPU=90.0
# start timer (EMA)
ts_emaCPU = timer()
# EMA on paths
emaCPU=emaNPathsCPU(pricePathsGPU_CPU[:,0:1000],lookbackCPU)
# end timer (EMA)
te_emaCPU = timer()
# compute time elasped
timeElasped_emaCPU=te_emaCPU-ts_emaCPU
# display time elasped
print('EMA Time Elasped (CPU): '+str(timeElasped_emaCPU))

The link to the example notebook is here:

Why is the GPU version so slow?

I was expecting the ‘path’ loop to run in parallel because it is completely independent. The utilization on the Titan GPU is about 25% while it is running, but is very slow.

What can I do to make the PyTorch version faster?

Any help would be greatly appreciated.

Many thanks,
Derek

The parallelism offered by the GPU is primarily per-op parallelism (there is something to be said about cuda streams, but that probably isn’t what you want to do here). The best way to speed this up is to vectorize over your Monte Carlo paths (possibly serializing over time).
As you appear to be looking into financial mathematics: I did something like that for the Hull-White-Model a long time ago (when we had Variable…).

Best regards

Thomas

Thank you Thomas for your response. The referenced notebook is helpful.

I was hoping that PyTorch would function in a way similar to numba.

Numba is great, but you have to think about the order of your computations nonetheless.
What you would need is the “holy grail of all polyhedral optimizers” or so that would rewrite the order of your entire computation. There are projects like TVM or TensorFlow MLIR that are working on pushing some boundaries, but so far spending a few thoughts yourself is the best way to get these things faster.

Best regards

Thomas