How can l run two blocks in parallel?

andravin · February 10, 2020, 6:15am

That’s not exactly how GPUs work. After a kernel issues all of its blocks, there is a “tail” of blocks that partially occupy the GPU. If there are just a few waves of blocks, then the tail can be a significant fraction of the total kernel execution time. Another kernel launched on a second stream could fill the device during the tail of the first kernel, increasing device utilization.

The issue for PyTorch would be that kernels that have a very short execution time will hit the limit of how fast you can launch kernels from the CPU (let alone from a python application).

CUDA Graphs were introduced to move kernel launch to the device side, reducing launch overhead for short running kernels.

Apparently there is an effort underway to use CUDA Graphs in PyTorch,

github.com/pytorch/pytorch

Static Graphs using CUDA 10 Graphs API

opened 03:15AM - 30 Dec 18 UTC

closed 12:51AM - 11 Sep 21 UTC

fps7806

feature module: cuda triaged module: cuda graphs

## 🚀 Feature CUDA 10 released a new feature called CUDA Graphs which allows you… to build static graphs that can minimizes the overhead of launching multiple kernels. The API comes with functions that allow you to capture a stream (multiple streams are also supported) and transform it into a CUDA graph. Exposing this feature to pytorch can be very beneficial to many applications. ## Motivation When working with small kernels it is important to minimize any overhead without adding complexity. ## Pitch I propose something like: ``` g = CUDAGraph() with g.capture(): # Static computation that can be combined into 1 graph .... while True: g.execute() ``` ## Alternatives Currently Pytorch offers JIT, and C++ APIs that mitigate the overhead. The CUDA Graph approach is theoretically faster and IMHO more friendly to users. ## Additional context I have a small demo working where I run a small neural network 1000 times and here are the current benchmarks I have: ``` Pytorch c++ Front-end: 152.633ms Pytorch c++ with cuda graphs capture:136.692ms Pytorch Python: 354.79ms Pytorch Python + cuda graphs capture: 134.29ms ``` cc @ngimel

… but it seems to have hit a snag:

github.com/pytorch/pytorch

Some parts of PyTorch test suite don't work properly on non-default stream

opened 02:43PM - 10 Jun 19 UTC

closed 05:50PM - 11 May 20 UTC

ezyang

module: cuda module: tests triaged large

In @fbhuba PR https://github.com/pytorch/pytorch/pull/21474 we switch the stream… before running tests. But in this PR, a lot of tests failed and we had to disable the stream switching behavior for large portions of the test suite. There is probably something very important in PyTorch that is not stream safe, we need to figure it out. Currently known reasons why you might be on wrong stream: * Calling thrust functions (e.g., `thrust::tabulate`) without specifying a policy that puts things on the right stream * Backwards runs on default stream

Anyway, as these issues are very interesting to me, I would like to learn more about what the roadmap looks like for this kind of functionality in PyTorch.

[Edit: fixed link to #15623]