Multiple operations in parallel on single GPU


I’m working on a problem involving sensitivity analysis and hoping to use pytorch and it’s in-build operations instead of coding everything from scratch in CUDA.

I’ve a small example code (using NN example as most people would be familiar with this here) as follows, where computations involving dZ and dA are independent of that of Z and A.

def sensitive(d_inp, inp, param):
    Z = torch.matmul(inp, param.T)
    dZ = torch.matmul(d_inp, param.T)

    A = torch.tanh(Z)
    dA = torch.unsqueeze(1 - torch.tanh(Z)**2, axis=1) * dZ

    return A, dA

I want to parallelise the code such that Z and dZ are computed in parallel, followed by the parallel evaluation of A and dA.

I was looking for solution to this but couldn’t find anything. Hope someone can help me out here.


they might be asynchronously executed invisibly.
Have a look at this note CUDA semantics — PyTorch 1.10 documentation

I just separated the two operations (value and gradient) into two separate functions and timed their execution independently. I don’t think it’s been executed asynchronously as their independent runs adds up to my earlier test.

Is there a way by which I can be certain that two operations are running in parallel?

I read a recent publication (arXiv pre-print) where they’ve done something similar and casually mentioned that they used operator overloading in PyTorch to facilitate the computations to run in parallel (they must be using inbuilt classes). I don’t understand how something like this can be done.

Sorry, that’s out of my knowledge :smiling_face_with_tear:.