Leveraging NVTX Structure

I use NVTX and Nsight when training with PyTorch and the level of detail from traces is just outstanding, those system details have been critical to my research. Kudos to the team on such great work!

My question is, how can I reuse some of this infrastructure for a Python/CUDA project? That is, how do I get up and running to leverage, maybe a minimal subset, torch’s NVTX scaffolding?

For example,

  1. I would love to have complete function tracing, where any Python/C++ function is traced all the way down including all sub-function calls.
  2. Also, seq and op_id fields on ranges are great to have.

Representative examples are shown below. That level of detail, I would love to have :slight_smile:


All the way down


Note the bytes field from the NCCL call

CUDA kernels will be recorded by default of course, but you can add markers around calls to add more details. Your first example shows how nvFuser adds nvtx markers and you can check the code to see how the backend handles it.

thanks @ptrblck. I guess what I was looking for was something of an NVTX 101 from existing code that would allow me to reimplement even a subset of host-side traces that PyTorch provides, the nvFuser screenshot I included was from a PyTorch training run. Still, I agree digging through the code of both PyTorch and NCCL, their NVTX instrumentation is also very good, would be the best next step!

You can also just search the linked nvFuser repository for nvtx and see how the markers were implemented.