Speeding Up Loops on GPU

dgn2 · July 13, 2019, 8:42pm

Hello,

I have been trying to use PyTorch to speed up some simple embarrassingly parallel computations with little success. I am looking for some guidance as to how to speed up the following simple code. Any help would be very much appreciated.

The following functions are to create data to use in the simple example further below:

import numpy
import math
import torch
import pandas
import timeit
from timeit import default_timer as timer

def assetPathsCPU(S0,mu,sigma,T,nRows,nPaths):
    dt = T/nRows
    nudt = (mu-0.5*sigma**2)*dt    
    sidt = sigma*math.sqrt(dt)
    increments = nudt + sidt*numpy.random.randn(int(nRows),int(nPaths))
    x=numpy.concatenate((math.log(S0)*numpy.ones((1,int(nPaths))),increments))
    pricePaths=numpy.exp(numpy.cumsum(x,axis=0))

    return pricePaths

def assetPathsGPU(S0,mu,sigma,T,nRows,nPaths,dtype,device):
    dt = T/nRows
    nudt = (mu-0.5*sigma**2)*dt
    sidt = sigma*torch.sqrt(dt)
    pricePaths=torch.exp(torch.cumsum(torch.cat((torch.log(S0)*torch.ones(1,nPaths,dtype=dtype,device=cuda0),
        torch.distributions.Normal(nudt,sidt).sample((nRows, nPaths)).squeeze()), dim=0),dim=0))
    
    return pricePaths

These are the simple functions - one for the CPU and one for the GPU:

def emaNPathsCPU(pricePaths,lookback):
    # find T and nPaths
    T,nPaths=pricePaths.shape
    # create output array
    ema=numpy.zeros([int(T),int(nPaths)])
    # compute the smoothing constant
    a = 2.0 / ( lookback + 1.0 )
    # iterate over each price path
    for pathIndex in range(0,int(nPaths)):
        # iterate over each price path
        ema[0,pathIndex] = pricePaths[0,pathIndex]
        # iterate over each point in time and compute the EMA 
        for t in range(1,T):
            ema[t,pathIndex]=a * (pricePaths[t,pathIndex]-ema[t-1,
                pathIndex]) + ema[t-1,pathIndex]
    return ema

def emaNPathsGPU(pricePaths,lookback,dtype,device):
    # find T and nPaths
    T,nPaths=pricePaths.shape
    # create output array
    #ema=numpy.zeros([int(T),int(nPaths)])
    ema=torch.zeros(T,nPaths,dtype=dtype,device=device)  
    # compute the smoothing constant
    a = 2.0 / ( lookback + 1.0 )
    ema[0,:] = pricePaths[0,:]    
    # iterate over each price path
    for pathIndex in range(nPaths):
        # iterate over each point in time and compute the EMA 
        for t in range(1,T):
            ema[t,pathIndex]=a * (pricePaths[t,pathIndex]-ema[t-1,
                pathIndex]) + ema[t-1,pathIndex]
    return ema

Here is how I call them:

cuda0 = torch.device('cuda:0')
dtype=torch.float64
lookbackGPU=torch.tensor(90.0,dtype=dtype,device=cuda0)

# start timer (EMA)
ts_emaGPU = timer()
# EMA on paths
emaGPU=emaNPathsGPU(pricePathsGPU[:,0:1000],lookbackGPU,dtype,cuda0)
# end timer (prices)
te_emaGPU = timer()
# compute time elasped
timeElasped_emaGPU=te_emaGPU-ts_emaGPU
# display time elasped
print('EMA Time Elasped (GPU): '+str(timeElasped_emaGPU))

pricePathsGPU_CPU=pricePathsGPU.cpu().numpy()
lookbackCPU=90.0
# start timer (EMA)
ts_emaCPU = timer()
# EMA on paths
emaCPU=emaNPathsCPU(pricePathsGPU_CPU[:,0:1000],lookbackCPU)
# end timer (EMA)
te_emaCPU = timer()
# compute time elasped
timeElasped_emaCPU=te_emaCPU-ts_emaCPU
# display time elasped
print('EMA Time Elasped (CPU): '+str(timeElasped_emaCPU))

The link to the example notebook is here:

github.com

dnokes/pytorch_examples/blob/master/simpleGpuVsCpuExample.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy\n",
    "import math\n",
    "import torch\n",
    "import pandas\n",
    "import timeit\n",
    "from timeit import default_timer as timer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},

This file has been truncated. show original

Why is the GPU version so slow?

I was expecting the ‘path’ loop to run in parallel because it is completely independent. The utilization on the Titan GPU is about 25% while it is running, but is very slow.

What can I do to make the PyTorch version faster?

Any help would be greatly appreciated.

Many thanks,
Derek

tom · July 13, 2019, 10:26pm

The parallelism offered by the GPU is primarily per-op parallelism (there is something to be said about cuda streams, but that probably isn’t what you want to do here). The best way to speed this up is to vectorize over your Monte Carlo paths (possibly serializing over time).
As you appear to be looking into financial mathematics: I did something like that for the Hull-White-Model a long time ago (when we had Variable…).

Best regards

Thomas

dgn2 · July 13, 2019, 10:34pm

Thank you Thomas for your response. The referenced notebook is helpful.

I was hoping that PyTorch would function in a way similar to numba.

tom · July 14, 2019, 9:24am

Numba is great, but you have to think about the order of your computations nonetheless.
What you would need is the “holy grail of all polyhedral optimizers” or so that would rewrite the order of your entire computation. There are projects like TVM or TensorFlow MLIR that are working on pushing some boundaries, but so far spending a few thoughts yourself is the best way to get these things faster.

Best regards

Thomas