I’m trying to optimize the performance of distributed machine learning by assigning priorities for data packets. I’m wondering is there a way to get the dependency graph for computation/communication operations? In Tensorflow, we can extract the prerequisites for each tensor, but Pytorch seems to build the computation graph at the runtime and there’s no API to do so.
That’s a great question! This is actually one item on our roadmap.
I’m wondering is there a way to get the dependency graph for computation/communication operations?
- It is possible. One option is to traverse the autograd graph from the output. This code shows how DDP reducer implements this currently.
- Another option is to do sth similar to APEX by logging the grad ready order in the first iteration, and use that to prioritize comms.
- Comm (async allreduce) operations are kicked off whenever a gradient bucket is ready.
This is a brief description of how DDP works. You can start from there. This paper seems did sth similar for PyTorch.
Do you mind if I ask the scope of the project you are working on? The PyTorch Distributed team hasn’t started working on this yet, but we have some ideas regarding how to do it. If you plan to modify PyTorch code or open source your solution, we can collaborate on this effort.
Thanks a lot for your reply! The scope of my work is similar to this paper TicTac: Accelerating Distributed Deep Learning with Communication Scheduling, https://arxiv.org/abs/1803.03288 that will use heuristics to pipelining communication and computation. Specifically, I want to design a new parameter server algorithm that will use a set of distributed servers and pipeline the communication/computation to improve the bottleneck.
For parameter server, torchrpc might be helpful. It’s currently an experimental feature and various improvements will come soon.