Use of CUDA Graphs for forward and for backward Dynamo/Inductor-compiled modules

As far as I understood, Inductor-generated code does not yet make use of CUDA graphs for futher speed-ups of multi-kernel code paths.

Is there somewhere a practical example of how to still use CUDA graphs for Inductor/Triton-generated code paths? (I read somewhere that e.g. the same input/output tensors must be used, so it should probable deal with this and with fixed collection of input shapes?)

It should work yeah, might just need to call torch._dynamo.reset() in between different calls

1 Like

Btw does torch.compile support different compilation options for different subgraphs, e.g. will it work as exptected if I first compile with some option a sub-module (e.g. with reduce-overhead), and then compile the whole module (without such options)?

Also, a question on whether it imposes constraints on using the same input/output tensors + multiple possible batch shapes?