Architecture of distributed Pytorch

ph0123 · June 25, 2020, 8:37pm

Hi,
I have a question about the architecture of distributed PyTorch!

When I run some examples, I saw that we can send and receive directly from worker A to worker B.
Why do we need MASTER_PORT and MASTER_ADDRESS?
For port, we can understand that they need this number to recognize other workers which belong to the same program or not. However, I do not understand why we need master_add?

if it is a Master-Slave model, I think that is no problem, and Master worker will manage all works.

Thanks,

mrshenli · June 25, 2020, 9:44pm

Hey @ph0123

The reason is that when we implement torch.distributed.rpc, we would like to abstract out the comm layer and reuse whatever is available in torch.disributed. At that time, ProcessGroup is the only option we have, which requires a rendezvous during initialization. The master port and address is needed for that rendezvous. Subsequent communications do not go through the master address.

As of v1.6.0, we added a new P2P comm backend implementation, https://pytorch.org/docs/master/rpc.html#tensorpipe-backend. And we do plan to remove the requirement for MASTER_PORT/MASTER_ADDRESS.

selineni · January 4, 2024, 6:07pm

hi Shen Li,

After the rendezvous, is there a way to restrict the connection between some peers in the p2p alignment. like for example if there are 3 peers, can i restrict communication between p1-p1 and p1-p3 exclusively. as I see it the process automatically starts ephemeral ports over tcp to talk to all peers even though i programmatically restricted data communication between p2 and p3? your response if greatly appreciated

selineni · January 4, 2024, 7:28pm

just like my other post, is there way for tensforpipe backend to limit which node can communicate with which? a quick response is greatly appreciated