Question about torch.distributed p2p communication

Hi,
I have a question about the p2p communication in torch.distributed. Suppose we set up a group with 3 processes using command init_process_group(backend=‘gloo’, init_method=“tcp://10.0.0.1:8888”, rank=args.rank, world_size=3) on three different nodes with IP 10.0.0.1 to 10.0.0.3. When we are sending tensors from 10.0.0.2 to 10.0.0.3, how is the underlying network traffic routed? Is it directly from 10.0.0.2 to 10.0.0.3 or from 10.0.0.2 to 10.0.0.1 and then to 10.0.0.3? Probably the answer is obvious but I couldn’t find it based on the doc’s description. Thanks in advance!

Yijing

Hey @yijing

The message will directly send from 10.0.0.2 to 10.0.0.3.

In init_process_group, the init_method=“tcp://10.0.0.1:8888” is only for rendezvous, i.e., all process will use the same ip:port to find each other. After that communications don’t need to go through master.

BTW, if you are using p2p comm, torchrpc might be useful too. Here is a tutoral.

1 Like