Torch.compile CudaGraph creation & downstream systems

Hi all, looking at the PyTorch code, I understand that torch compiler codegen process first assesses/analyses PyTorch/python function code to see if CudaGraph will work. Here I am assuming that CudaGraph doesn’t support many things that must be checked upfront (heterogeneous/non-Nvidia hardware platform where CudaGraph may not be supported). If it passes that test, codegen creates a runtime wrapper around the generated triton code for applying CudaGraph using “cudagraphify” function in the code [call(args): function in the generated code].

With that as a quick background of my understanding of all this, the question I have is how does all this play out in the downstream inference engines (such as vLLM, SGLang etc.) as they do their own CudaGraph capture/replay at the time of server boot-up process. Do these systems (re)evaluate the applicability of CudaGraph again at the time of CudaGraph capture. In any case, I am just trying to understand the overall Pytorch codegen/CudaGraph process for my education/understanding and clear my brain fog on this topic. Look forward to response from experts on this forum. Appreciate it. Thanks.

Regards,
Deepak Vij

vLLM handles its cuda-graphing separately from torch.compile right now. So it does not use torch.compile’s built-in cudagraphs.

Makes sense. Thanks.