Hi,
Is there any method to analyze the transfer process of distributed training in Pytorch? I want to predict the networking overhead for each mini-batch, given the known bandwidth.
Thanks
Hi,
Is there any method to analyze the transfer process of distributed training in Pytorch? I want to predict the networking overhead for each mini-batch, given the known bandwidth.
Thanks
Hi @zizhao.mo thanks for posting!
Since PyTorch 1.9, we released PyTorch profiler with distributed training view support, which could help you analyze the time/memory consumed by the training job, you can refer to this section What’s New in PyTorch Profiler 1.9? | PyTorch
As @wanchaol mentioned, What’s New in PyTorch Profiler 1.9? | PyTorch will give you insights regarding synchronization/communication overview.
Although, could you describe your use case a little more? What specifically are you predicting with regard to “networking overhead”, do you mean per-worker communication latency?
Thank you very much.
Hi Rohan,
Thanks for your kindly response. The networking overhead I mentioned is the total time spent on aggregating the gradient. Actually, I am writing a scheduler on top of PyTorch, to achieve the best efficiency (execute as many epochs as possible per unit time). In this case, I can determine how many workers I should run (maybe cross hosts). So profiling the gradient aggregation overhead is important for me, since this can help me determine if I should create a new worker or not for jobs.