I have a question about the architecture of distributed PyTorch!
When I run some examples, I saw that we can send and receive directly from worker A to worker B.
Why do we need MASTER_PORT and MASTER_ADDRESS?
For port, we can understand that they need this number to recognize other workers which belong to the same program or not. However, I do not understand why we need master_add?
if it is a Master-Slave model, I think that is no problem, and Master worker will manage all works.
The reason is that when we implement
torch.distributed.rpc, we would like to abstract out the comm layer and reuse whatever is available in
torch.disributed. At that time,
ProcessGroup is the only option we have, which requires a rendezvous during initialization. The master port and address is needed for that rendezvous. Subsequent communications do not go through the master address.
As of v1.6.0, we added a new P2P comm backend implementation, https://pytorch.org/docs/master/rpc.html#tensorpipe-backend. And we do plan to remove the requirement for