Profiling Distributed Data Parallel Applications


I am trying to profile an application using DistributedDataParallel Module.

Is there a specific set of guidelines to measure the communication overheads (allreduce time, broadcast time, etc)?

I used the with torch.autograd.profiler.profile(use_cuda=True). But I didn’t get information about these calls. This may only track basic calls not functions like allreduce or broadcast happening in ProcessGroups (NCCL) layer.

Please correct me if I am wrong.

Thank You,

Hey @Vibhatha_Abeykoon DDP does not work with autograd profiler yet, but this is in our roadmap. Before that, will nvprof able to serve your use case?

1 Like

@mrshenli Sorry for the late response. Yes, it could also be useful. I will check.
Thank You.

Profiling with DDP is be enabled now, i.e. the collectives will be profiled. You can simply run the DDP model under profiler as normal:

with torch.profiler.profile():
1 Like

@rvarm1 – do you have a more thorough example of using the torch profiler with DDP and torch.distributed? I’m trying to apply it to my training script but am running into issues where dist.init_process_group hangs indefinitely when using the profiler. The script runs normally when excluding any instance of the profiler.