Leveraging NVTX Structure

Jonathan1 · June 24, 2024, 7:57pm

I use NVTX and Nsight when training with PyTorch and the level of detail from traces is just outstanding, those system details have been critical to my research. Kudos to the team on such great work!

My question is, how can I reuse some of this infrastructure for a Python/CUDA project? That is, how do I get up and running to leverage, maybe a minimal subset, torch’s NVTX scaffolding?

For example,

I would love to have complete function tracing, where any Python/C++ function is traced all the way down including all sub-function calls.
Also, seq and op_id fields on ranges are great to have.

Representative examples are shown below. That level of detail, I would love to have

All the way down

Note the bytes field from the NCCL call

ptrblck · June 25, 2024, 2:22pm

CUDA kernels will be recorded by default of course, but you can add markers around calls to add more details. Your first example shows how nvFuser adds nvtx markers and you can check the code to see how the backend handles it.

Jonathan1 · June 28, 2024, 6:12am

thanks @ptrblck. I guess what I was looking for was something of an NVTX 101 from existing code that would allow me to reimplement even a subset of host-side traces that PyTorch provides, the nvFuser screenshot I included was from a PyTorch training run. Still, I agree digging through the code of both PyTorch and NCCL, their NVTX instrumentation is also very good, would be the best next step!

ptrblck · June 28, 2024, 1:14pm

You can also just search the linked nvFuser repository for nvtx and see how the markers were implemented.