How can l run two blocks in parallel?


l would like to run in parallel (and in the same GPU) block1(x) and block2(x), which are two independent blocks parameterized by the same input signal x. Here is my forward function.
How can l run x1=self.block1(x) and x2=self.block2(x) in parallel rather than in sequential ?

Thank you

def  forward(self,x):
       return x


The cuda api is asynchronous. So as long as you don’t print the values of the results, they will be pipelined automatically by the cuda driver.
If you have a powerful enough GPU, they might already be running in parallel.

1 Like

Hi @albanD, @DeepLearner17

The two modules x1 and x2 in the example will run sequentially on the same CUDA stream.

Kernels launched on the same stream are not pipelined. All blocks from the first kernel must complete before any blocks from the second kernel can be issued. This is the most basic level of synchronization provided by CUDA.

In order to pipeline two modules, PyTorch would need to run the underlying CUDA kernels on different streams. Then blocks from the second module could begin issuing as soon as all blocks from the first module have issued.

There is a PyTorch API for this. Apparently it respects the stream assignments during the backwards pass, too. See

One could imagine a simpler API: Create a module called Parallel that takes a list of of modules and runs them each on a separate stream, and synchronizes at the end. It would looks like this:

def __init__(self, x):
    self.branches = nn.Parallel(block1, block2)

def forward(self, x):
    x1, x2 = self.branches(x, x)
    x =, x2)
    return x

You could build large parallel branches by composing Parallel with Sequential.

I think this API might be simpler than using streams directly, because we usually don’t care which streams are actually used for different modules, we just want to indicate the parallelism in the model and have the framework exploit it automatically.

Anyway hope this helps somebody.


In practice, most of our kernels actually use the whole GPU and so kernels don’t run at the same time.
So we see benefits from using stream only in very niche case, and general users wouldn’t need it.
We only provide the API for advanced user that know that their workload is in this niche case and they want to slightly improve the performances.

Hi @albanD,

That’s not exactly how GPUs work. After a kernel issues all of its blocks, there is a “tail” of blocks that partially occupy the GPU. If there are just a few waves of blocks, then the tail can be a significant fraction of the total kernel execution time. Another kernel launched on a second stream could fill the device during the tail of the first kernel, increasing device utilization.

The issue for PyTorch would be that kernels that have a very short execution time will hit the limit of how fast you can launch kernels from the CPU (let alone from a python application).

CUDA Graphs were introduced to move kernel launch to the device side, reducing launch overhead for short running kernels.

Apparently there is an effort underway to use CUDA Graphs in PyTorch,

… but it seems to have hit a snag:

Anyway, as these issues are very interesting to me, I would like to learn more about what the roadmap looks like for this kind of functionality in PyTorch.

[Edit: fixed link to #15623]

1 Like

cc @ngimel that will know better what are the plans for such features.

1 Like

cc @ptrblck as cuda graphs effort is driven by nvidia, so he knows more about roadmaps. #21589 did not hit a snag, in fact, we believe that now pytorch is stream-safe and non-default streams can be used as needed. Some operations incure host-device synchronization and thus would break graph capture, but the majority should be good. Biggest problem for enabling cuda graphs last time I was involved in this was caching allocator - cuda graph wants kernel arguments (including data pointers) to remain the same between invocations, and with caching allocator we cannot guarantee that.
As for tail effect, indeed, streams would help there, but it is still a pretty niche usecase (you need a kernel that has just a few waves for the tail to be a noticeable part of execution time), which translates into a fairly narrow range of sizes where it would be observable.


The “snag” was meant to refer to “Static Graphs using CUDA 10 Graphs API #15623”, I updated my comment with the correct link.

I’m not sure that low achieved occupancy is so uncommon. A small model (eg MobileNetV2) on a big GPU will have just a few waves per kernel even for a moderate batch size. Small batch size is especially interesting for low latency inference. It is also possible that choosing a small batch size could improve cache utilization.

I guess I need to learn more about the PyTorch plan for efficient inference on state of the art vision models, efficient training of small models on big GPUs, etc.