Analysis of gradient transfer overhead in distributed training


Is there any method to analyze the transfer process of distributed training in Pytorch? I want to predict the networking overhead for each mini-batch, given the known bandwidth.


Hi thanks for posting!

Since PyTorch 1.9, we released PyTorch profiler with distributed training view support, which could help you analyze the time/memory consumed by the training job, you can refer to this section What’s New in PyTorch Profiler 1.9? | PyTorch

As @wanchaol mentioned, What’s New in PyTorch Profiler 1.9? | PyTorch will give you insights regarding synchronization/communication overview.

Although, could you describe your use case a little more? What specifically are you predicting with regard to “networking overhead”, do you mean per-worker communication latency?

Thank you very much.

Hi Rohan,

Thanks for your kindly response. The networking overhead I mentioned is the total time spent on aggregating the gradient. Actually, I am writing a scheduler on top of PyTorch, to achieve the best efficiency (execute as many epochs as possible per unit time). In this case, I can determine how many workers I should run (maybe cross hosts). So profiling the gradient aggregation overhead is important for me, since this can help me determine if I should create a new worker or not for jobs.