I am trying to profile an application using DistributedDataParallel Module.
Is there a specific set of guidelines to measure the communication overheads (allreduce time, broadcast time, etc)?
I used the
with torch.autograd.profiler.profile(use_cuda=True). But I didn’t get information about these calls. This may only track basic calls not functions like allreduce or broadcast happening in ProcessGroups (NCCL) layer.
Please correct me if I am wrong.
Hey @Vibhatha_Abeykoon DDP does not work with autograd profiler yet, but this is in our roadmap. Before that, will nvprof able to serve your use case?
@mrshenli Sorry for the late response. Yes, it could also be useful. I will check.
Profiling with DDP is be enabled now, i.e. the collectives will be profiled. You can simply run the DDP model under profiler as normal:
@rvarm1 – do you have a more thorough example of using the torch profiler with DDP and
torch.distributed? I’m trying to apply it to my training script but am running into issues where
dist.init_process_group hangs indefinitely when using the profiler. The script runs normally when excluding any instance of the profiler.