Profiling Distributed Data Parallel Applications


I am trying to profile an application using DistributedDataParallel Module.

Is there a specific set of guidelines to measure the communication overheads (allreduce time, broadcast time, etc)?

I used the with torch.autograd.profiler.profile(use_cuda=True). But I didn’t get information about these calls. This may only track basic calls not functions like allreduce or broadcast happening in ProcessGroups (NCCL) layer.

Please correct me if I am wrong.

Thank You,

Hey @Vibhatha_Abeykoon DDP does not work with autograd profiler yet, but this is in our roadmap. Before that, will nvprof able to serve your use case?

1 Like

@mrshenli Sorry for the late response. Yes, it could also be useful. I will check.
Thank You.

Profiling with DDP is be enabled now, i.e. the collectives will be profiled. You can simply run the DDP model under profiler as normal:

with torch.profiler.profile():
1 Like