Hi,
I have a question about the architecture of distributed PyTorch!
When I run some examples, I saw that we can send and receive directly from worker A to worker B.
Why do we need MASTER_PORT and MASTER_ADDRESS?
For port, we can understand that they need this number to recognize other workers which belong to the same program or not. However, I do not understand why we need master_add?
if it is a Master-Slave model, I think that is no problem, and Master worker will manage all works.
The reason is that when we implement torch.distributed.rpc, we would like to abstract out the comm layer and reuse whatever is available in torch.disributed. At that time, ProcessGroup is the only option we have, which requires a rendezvous during initialization. The master port and address is needed for that rendezvous. Subsequent communications do not go through the master address.
After the rendezvous, is there a way to restrict the connection between some peers in the p2p alignment. like for example if there are 3 peers, can i restrict communication between p1-p1 and p1-p3 exclusively. as I see it the process automatically starts ephemeral ports over tcp to talk to all peers even though i programmatically restricted data communication between p2 and p3? your response if greatly appreciated