How can l run two blocks in parallel?

DeepLearner17 · November 20, 2019, 2:14pm

Hi,

l would like to run in parallel (and in the same GPU) block1(x) and block2(x), which are two independent blocks parameterized by the same input signal x. Here is my forward function.
How can l run x1=self.block1(x) and x2=self.block2(x) in parallel rather than in sequential ?

Thank you

def  forward(self,x):
       x1=self.block1(x)       
       x2=self.block2(x)      
       x= torch.mm(x1,x2)      
       return x

albanD · November 20, 2019, 2:35pm

Hi,

The cuda api is asynchronous. So as long as you don’t print the values of the results, they will be pipelined automatically by the cuda driver.
If you have a powerful enough GPU, they might already be running in parallel.

andravin · February 8, 2020, 12:09am

Hi @albanD, @DeepLearner17

The two modules x1 and x2 in the example will run sequentially on the same CUDA stream.

Kernels launched on the same stream are not pipelined. All blocks from the first kernel must complete before any blocks from the second kernel can be issued. This is the most basic level of synchronization provided by CUDA.

In order to pipeline two modules, PyTorch would need to run the underlying CUDA kernels on different streams. Then blocks from the second module could begin issuing as soon as all blocks from the first module have issued.

There is a PyTorch API for this. Apparently it respects the stream assignments during the backwards pass, too. See https://github.com/pytorch/pytorch/pull/8354

One could imagine a simpler API: Create a module called Parallel that takes a list of of modules and runs them each on a separate stream, and synchronizes at the end. It would looks like this:

def __init__(self, x):
    ...
    self.branches = nn.Parallel(block1, block2)
    ...

def forward(self, x):
    x1, x2 = self.branches(x, x)
    x = torch.mm(x1, x2)
    return x

You could build large parallel branches by composing Parallel with Sequential.

I think this API might be simpler than using streams directly, because we usually don’t care which streams are actually used for different modules, we just want to indicate the parallelism in the model and have the framework exploit it automatically.

Anyway hope this helps somebody.

albanD · February 8, 2020, 2:18am

In practice, most of our kernels actually use the whole GPU and so kernels don’t run at the same time.
So we see benefits from using stream only in very niche case, and general users wouldn’t need it.
We only provide the API for advanced user that know that their workload is in this niche case and they want to slightly improve the performances.

andravin · February 10, 2020, 6:15am

Hi @albanD,

That’s not exactly how GPUs work. After a kernel issues all of its blocks, there is a “tail” of blocks that partially occupy the GPU. If there are just a few waves of blocks, then the tail can be a significant fraction of the total kernel execution time. Another kernel launched on a second stream could fill the device during the tail of the first kernel, increasing device utilization.

The issue for PyTorch would be that kernels that have a very short execution time will hit the limit of how fast you can launch kernels from the CPU (let alone from a python application).

CUDA Graphs were introduced to move kernel launch to the device side, reducing launch overhead for short running kernels.

Apparently there is an effort underway to use CUDA Graphs in PyTorch,

github.com/pytorch/pytorch

Static Graphs using CUDA 10 Graphs API

opened 03:15AM - 30 Dec 18 UTC

closed 12:51AM - 11 Sep 21 UTC

fps7806

feature module: cuda triaged module: cuda graphs

## 🚀 Feature CUDA 10 released a new feature called CUDA Graphs which allows you… to build static graphs that can minimizes the overhead of launching multiple kernels. The API comes with functions that allow you to capture a stream (multiple streams are also supported) and transform it into a CUDA graph. Exposing this feature to pytorch can be very beneficial to many applications. ## Motivation When working with small kernels it is important to minimize any overhead without adding complexity. ## Pitch I propose something like: ``` g = CUDAGraph() with g.capture(): # Static computation that can be combined into 1 graph .... while True: g.execute() ``` ## Alternatives Currently Pytorch offers JIT, and C++ APIs that mitigate the overhead. The CUDA Graph approach is theoretically faster and IMHO more friendly to users. ## Additional context I have a small demo working where I run a small neural network 1000 times and here are the current benchmarks I have: ``` Pytorch c++ Front-end: 152.633ms Pytorch c++ with cuda graphs capture:136.692ms Pytorch Python: 354.79ms Pytorch Python + cuda graphs capture: 134.29ms ``` cc @ngimel

… but it seems to have hit a snag:

github.com/pytorch/pytorch

Some parts of PyTorch test suite don't work properly on non-default stream

opened 02:43PM - 10 Jun 19 UTC

closed 05:50PM - 11 May 20 UTC

ezyang

module: cuda module: tests triaged large

In @fbhuba PR https://github.com/pytorch/pytorch/pull/21474 we switch the stream… before running tests. But in this PR, a lot of tests failed and we had to disable the stream switching behavior for large portions of the test suite. There is probably something very important in PyTorch that is not stream safe, we need to figure it out. Currently known reasons why you might be on wrong stream: * Calling thrust functions (e.g., `thrust::tabulate`) without specifying a policy that puts things on the right stream * Backwards runs on default stream

Anyway, as these issues are very interesting to me, I would like to learn more about what the roadmap looks like for this kind of functionality in PyTorch.

[Edit: fixed link to #15623]

albanD · February 10, 2020, 3:42pm

cc @ngimel that will know better what are the plans for such features.

ngimel · February 10, 2020, 5:30pm

cc @ptrblck as cuda graphs effort is driven by nvidia, so he knows more about roadmaps. #21589 did not hit a snag, in fact, we believe that now pytorch is stream-safe and non-default streams can be used as needed. Some operations incure host-device synchronization and thus would break graph capture, but the majority should be good. Biggest problem for enabling cuda graphs last time I was involved in this was caching allocator - cuda graph wants kernel arguments (including data pointers) to remain the same between invocations, and with caching allocator we cannot guarantee that.
As for tail effect, indeed, streams would help there, but it is still a pretty niche usecase (you need a kernel that has just a few waves for the tail to be a noticeable part of execution time), which translates into a fairly narrow range of sizes where it would be observable.

andravin · February 10, 2020, 8:13pm

The “snag” was meant to refer to “Static Graphs using CUDA 10 Graphs API #15623”, I updated my comment with the correct link.

I’m not sure that low achieved occupancy is so uncommon. A small model (eg MobileNetV2) on a big GPU will have just a few waves per kernel even for a moderate batch size. Small batch size is especially interesting for low latency inference. It is also possible that choosing a small batch size could improve cache utilization.

I guess I need to learn more about the PyTorch plan for efficient inference on state of the art vision models, efficient training of small models on big GPUs, etc.