Timeline of NCCL activity

I am following the distributed data parallel (DDP) tutorial on a server with two NVLink’d GPUs. I want to see which PyTorch code leads to NCCL collective calls (Broadcast, Gather etc). Is there a profiler option that lets me do this? So far, I run nsys profile --gpu-metrics-device=all python3 ddp.py to profile my code and then I analyze the generated profile with nsys stats <report>. This however does not give me a timeline of any kind.

Open the profile in nsys-ui and you will see the timeline.