Hi community,
I am working with vLLM engine for deep-diving in LLM inference process. I am trying to capture the inputs to triton kernels in vLLM and then re-running these kernels standalone to understand each kernel’s execution time. For implementing this, I have tried both approaches mentioned here in pytorch docs to capture the input tensors for each node captured in the dynamo/AOT-autograd graph. However, I noticed that the triton kernels which are invoked from functions registered as torch custom ops (example), are not visible to the torch compilation graph at either dynamo or AOT-autograd. I want to understand if there’s any way I can trace the underlying triton kernels which are called by functions registered torch custom ops. Thanks!