Aggregate profiler output by module class?

I’m trying to use the profiler, and I’m having trouble interpreting the output. Is it possible to aggregate the profiling data by the python class(es) which invoked the operation? I’d love to be able to see how much time/memory is going into MultiHeadAttention instead of just learning that “matmul” is the biggest hog.

If the profiling captures the call stack when operations are invoked I’d guess this wouldn’t be too hard. If it’s not capturing the stack, I’d certainly be willing to pay whatever performance penalty is needed while running the profile.

If you are profiling your script on the GPU, you could use nvtx ranges and tag the multi head attention operation with it.
I don’t know, if the built-in profiler provides options to print layer-specific outputs, but you could also take a look at pyprof.

1 Like

Thanks for the tip. I’m trying to understand it, as the docs are pretty thin. Would I do something like this:

    torch.cuda.nvtx.range_push("multihead_attention")
    out = self.attention(x)
    torch.cuda.nvtx.range_pop("multihead_attention")

And then hopefully this shows up in the profiler output?

You don’t have to specify the range name in range_pop, but besides that it should work.

My workflow would be:

# Setup
...

# enable the profiling, such that nvprof or nsys doesn't capture operations before this point
torch.cuda.cudart().cudaProfilerStart()

torch.cuda.nvtx.range_push("forward")
output = model(input)
torch.cuda.nvtx.range_pop()

# don't profile this
optimizer.zero_grad()
loss = criterion(output, target)

torch.cuda.nvtx.range_push("backward")
loss.backward()
torch.cuda.nvtx.range_pop()

torch.cuda.nvtx.range_push("optimizer.step()")
optimizer.step()
torch.cuda.nvtx.range_pop()

torch.cuda.cudart().cudaProfilerStop()

You can of course use nested ranges, if needed.

Then run the script via:

# with nsys
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu -o my_report --capture-range=cudaProfilerApi --stop-on-range-end=true --cudabacktrace=true --cudabacktrace-threshold=10000 --osrt-threshold=10000 -x true python script.py

# or with nvprof
# single-process
nvprof --profile-from-start off -fo %p.nvprof python script.py

# multi-process
nvprof --profile-child-processes --profile-from-start off -fo %p.nvprof python -m torch.distributed.launch  --nproc_per_node=2 script.py

Note that my nsys cmd creates stack traces etc., so you might want to slim it down a bit.

1 Like