Dist.broadcast and dist.send do not work for more than 2 ranks

tc22 · June 24, 2021, 2:08am

I’m running a piece of code (attached below) that is very similar to the one listed here: Writing Distributed Applications with PyTorch — PyTorch Tutorials 1.9.0+cu102 documentation - in particular, the section about sending a tensor that is incremented by 1. However, instead of having only 2 ranks, I am using 5, and instead of using send, I am broadcasting the incremented tensor from the source (rank 0) to the remaining ranks, but the script doesn’t terminate (note - I have also tried using send to all the other ranks and it hasn’t worked). I’m pretty sure the line that uses dist.broadcast is the source of the issue, as the script never gets past that line. I’m not sure if this is because of some fundamental misunderstanding (as I’m new to distributed training) so it would be appreciated if someone could shed some light onto this.

import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def run(rank, size):
    tensor = torch.zeros(1)
    if rank == 0:
        tensor += 1
        # Send the tensor to process 1
        dist.broadcast(tensor=tensor, src=0)
    else:
        # Receive tensor from process 0
        print("receiving")
        dist.recv(tensor=tensor, src=0)
    print('Rank ', rank, ' has data ', tensor[0])

def init_process(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    size = 5
    processes = []
    mp.set_start_method("spawn")
    for rank in range(size):
        p = mp.Process(target=init_process, args=(rank, size, run))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

Yanli_Zhao · June 25, 2021, 4:46pm

you need to call ‘dist.broadcast()’ instead of dist.recv() in recv ranks

tc22 · June 25, 2021, 4:58pm

Oh I see, thanks for the clarification!