I am trying to profile an application using DistributedDataParallel Module.
Is there a specific set of guidelines to measure the communication overheads (allreduce time, broadcast time, etc)?
I used the with torch.autograd.profiler.profile(use_cuda=True). But I didn’t get information about these calls. This may only track basic calls not functions like allreduce or broadcast happening in ProcessGroups (NCCL) layer.
@rvarm1 – do you have a more thorough example of using the torch profiler with DDP and torch.distributed? I’m trying to apply it to my training script but am running into issues where dist.init_process_group hangs indefinitely when using the profiler. The script runs normally when excluding any instance of the profiler.