I’m running a piece of code (attached below) that is very similar to the one listed here: Writing Distributed Applications with PyTorch — PyTorch Tutorials 1.9.0+cu102 documentation - in particular, the section about sending a tensor that is incremented by 1. However, instead of having only 2 ranks, I am using 5, and instead of using send, I am broadcasting the incremented tensor from the source (rank 0) to the remaining ranks, but the script doesn’t terminate (note - I have also tried using send to all the other ranks and it hasn’t worked). I’m pretty sure the line that uses dist.broadcast is the source of the issue, as the script never gets past that line. I’m not sure if this is because of some fundamental misunderstanding (as I’m new to distributed training) so it would be appreciated if someone could shed some light onto this.
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
def run(rank, size):
tensor = torch.zeros(1)
if rank == 0:
tensor += 1
# Send the tensor to process 1
dist.broadcast(tensor=tensor, src=0)
else:
# Receive tensor from process 0
print("receiving")
dist.recv(tensor=tensor, src=0)
print('Rank ', rank, ' has data ', tensor[0])
def init_process(rank, size, fn, backend='gloo'):
""" Initialize the distributed environment. """
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(backend, rank=rank, world_size=size)
fn(rank, size)
if __name__ == "__main__":
size = 5
processes = []
mp.set_start_method("spawn")
for rank in range(size):
p = mp.Process(target=init_process, args=(rank, size, run))
p.start()
processes.append(p)
for p in processes:
p.join()