I am trying to profile an application using DistributedDataParallel Module.
Is there a specific set of guidelines to measure the communication overheads (allreduce time, broadcast time, etc)?
I used the
with torch.autograd.profiler.profile(use_cuda=True). But I didn’t get information about these calls. This may only track basic calls not functions like allreduce or broadcast happening in ProcessGroups (NCCL) layer.
Please correct me if I am wrong.