Investigating toch custom ops using torch.compile

Hi community,

I am working with vLLM engine for deep-diving in LLM inference process. I am trying to capture the inputs to triton kernels in vLLM and then re-running these kernels standalone to understand each kernel’s execution time. For implementing this, I have tried both approaches mentioned here in pytorch docs to capture the input tensors for each node captured in the dynamo/AOT-autograd graph. However, I noticed that the triton kernels which are invoked from functions registered as torch custom ops (example), are not visible to the torch compilation graph at either dynamo or AOT-autograd. I want to understand if there’s any way I can trace the underlying triton kernels which are called by functions registered torch custom ops. Thanks!

I’m unsure if I understand your use case correctly but did you try to profile the workload (e.g. via nsys) to see which kernels are called? If so, are you missing some kernels there?

Hi @ptrblck, I am sure I’ll be able to see all the underlying kernels executed in nsys profiler. However, my goal is to trace it programmatically at torch compile level (ideally not going down to tracing at CUDA level) and analyzing/saving the kernel inputs. The issue I primarily face is for kernels registered as torch custom ops and those kernels are kind of opaque to my tracing mechanism at torch compile level