As a H/W accelerator, we’d like to integrate our compiler / runtime into Pytorch as dynamo backend. Our primitive API is basically asynchronous (async def in Python) and this makes me somewhat confusing as torch is basically all of synchronous code base.
I’ve read how to achieve asynchronicity in GPU, and it looks deeply integrated with CUDA and pytorch internal API so make me felt unclear how to integrate with non GPU like us.
Can anybody explain how to use async functions for compiler backend?
Part of the way that dynamo + torch.compile execution works is that users can compile “partial graphs”, but can also fall back to eager mode between graphs.
This leaves you with a choice as a hardware backend:
(1) Does your hardware backend support both a compilation mode and “standard” pytorch eager mode?
(2) Or does it only support compilation, and e.g. fall back to an existing eager mode backend, like our CPU or GPU kernels
If you go with (2), then your “async compiler backend” is limited to the existing asynchronous execution of the eager mode backend that you use. E.g. if the user executes a compiled subgraph, graph breaks, and runs the next op on CPU, then the output of the compiled graph can’t really be asynchronous - because the output of that compiled graph needs to be completed and exist in CPU memory, so the next CPU operator can run in eager mode.
If you also support eager mode execution as a custom backend, then you could probably implement async execution in eager mode, similar to how it’s implemented for the cuda backend. You can check out this example (with a test) for how to implement an out-of-tree eager mode backend. We have support for implementing eager mode backends out of tree - feel free to file an issue if you think there’s a functionality gap, since we’re looking to improve this!
We are in (2) as we don’t have ‘standard’ pytorch eager mode implementation.
If I explain from my viewpoint whether I understood you, let a graph have 4 ops like below
op1(cpu) → op2(npu) → op3(npu) → op4(cpu).
You are saying op3, op4 would not be truly asynchronous as input tensor of op4 which will run on CPU should wait for op3’s output which run on NPU, right? It sounds reasonable to me as different tensors from different computation units are not naturally integrated.
But what I am wondering here is async operation between op2 and op3. In a OP (actually it’s not OP. Rather some of accelertable OPs in our H/W), we have both of computations and I/Os and we’d like to overlap such operations.
As for option (1), I think your explanation looks very helpful to understand what’s basic interface torch requires. But from my basic understanding, torch 2.0 compilation path where torch claims for more lightweight integration point those for new h/w accelerators, it looks a bit scary if we still should implement for ‘standard’ path as well to support performant way that we’ve provided by our standalone API.
Is it worthwhile to implement our eager mode path for H/W just to support async operations? I’m asking your opinion because I might be worried in advance because I don’t know torch API well.