Is it possible to execute two modules in Parallel in Pytorch?

Hello everyone.
Is it possible to run two modules in parallel in Pytorch?
Currently all I have seen so far, has been in a sequential manner but can I for example do sth like this :

       input 
p1       |        p2
|********|***********
|                   |
|                   |
|********|***********
         |
         |
     output

we can implement this in Pytorch easily by just first running operations in path1(p1) and then path2(p2)
and then combine their results.
But is there a way that I can make p1 and p2 run in parallel and execute them faster this way without one being delayed before the other one runs first?
Thanks in advance

If you are running the code on a GPU and call p1 and p2 after each other, these calls will be queued onto the device and executed asynchronously. Depending on the workload of p1, p2 might start while p1 is still executing.
If you have multiple GPUs in your system (and don’t use data parallel), you could execute p1 and p2 on each device and concatenate the result back on a single device.

2 Likes

Hi @ptrblck, may I ask a follow-up question.

If you are running the code on a GPU and call p1 and p2 after each other, these calls will be queued onto the device and executed asynchronously. Depending on the workload of p1 , p2 might start while p1 is still executing.

So in the case of single-GPU, p1 and p2 could run in parallel depending on the workload, and this is great since we don’t have to code any particular logic to make them run in parallel.

If you have multiple GPUs in your system (and don’t use data parallel), you could execute p1 and p2 on each device and concatenate the result back on a single device.

But I’m not clear about the 2nd part of your answer, do you mean in the case of multiple-GPU, we need to explicitly code the logic (e.g. moving p1 to device1, and p2 to device2) to make them run in parallel?

Yes, you could manually move the data as well as model copies to multiple devices and execute them in parallel directly. While the first approach could also run workloads in parallel, it depends on the actual workload as mentioned before.
I wouldn’t assume to see overlapping kernel execution when e.g. cublas matmul kernels are executed, as they tend to allocate all compute resources of the device so that other kernels would have to wait.

1 Like