As far as I understood, Inductor-generated code does not yet make use of CUDA graphs for futher speed-ups of multi-kernel code paths.
Is there somewhere a practical example of how to still use CUDA graphs for Inductor/Triton-generated code paths? (I read somewhere that e.g. the same input/output tensors must be used, so it should probable deal with this and with fixed collection of input shapes?)
It should work yeah, might just need to call
torch._dynamo.reset() in between different calls
Btw does torch.compile support different compilation options for different subgraphs, e.g. will it work as exptected if I first compile with some option a sub-module (e.g. with reduce-overhead), and then compile the whole module (without such options)?
Also, a question on whether it imposes constraints on using the same input/output tensors + multiple possible batch shapes?