Aten Native CUDA Op Tracing

Jerome_Ku · October 25, 2023, 5:06am

Hi!

I’m profiling pytorch code using NVIDIA nsys and noticed calls to kernels such as
at::native::::CatArrayBatchedCopy and at::native::elementwise_kernel. I can see the implementations of these kernels here and here.

However, I can’t seem to find where these kernels are actually called when searching the repo – searching for these identifiers only results in the implementation files.

I’d like the trace the chain of calls leading to these kernel launches from higher-level operators (e.g., Conv2d), and more generally, understand the internal plumbing of pytorch in greater detail.

Would greatly appreciate if anybody could explain how Aten native operators are connected to higher-level functions and eventually connected to Python!

bdhirsh · October 28, 2023, 8:14am

at::native kernels are dynamically registered to the pytorch dispatcher through code-generated files (which has been a very useful abstraction, but can make grepping the codebase for them difficult).

Some helpful links:

intro to the dispatcher: https://blog.ezyang.com/2020/09/lets-talk-about-the-pytorch-dispatcher/

deeper dive into the full set of function calls from torch* to a native kernel: PyTorch dispatcher walkthrough · pytorch/pytorch Wiki · GitHub

Jerome_Ku · October 28, 2023, 5:10pm

Thanks! Those are great resources – been a big fan of EZYang’s in-depth podcasts and posts on pytorch internals. The wiki is also a goldmine.

FWIW, running nsys with sampling enabled (--sample cpu) lets you see full function traces. Here is a sample screenshot:

saurabh-singh-rajput · May 24, 2024, 9:57pm

I’m sorry to bother you, but is there a way to run these kernels standalone in exclusion to benchmark?