Hi. As per my understanding torch.distributed.barrier
will put the first process on hold until all the other processes has reached to the same point. Well it does not seem to be working on the cluster. I am using nccl backend. I have added some print statements just to debug.
My code:
def init_process(rank, size, backend='nccl'):
""" Initialize the distributed environment. """
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(backend, rank=rank, world_size=size)
def train(rank, world_size):
print("inside train method")
init_process(rank, world_size, "nccl")
print(
f"Rank {rank + 1}/{world_size} process initialized."
)
if rank == 0:
get_dataloader(rank, world_size)
model.gru()
dist.barrier()
print(f"Rank {rank + 1}/{world_size} training process passed data download barrier.")
def main():
mp.spawn(train,
args=(WORLD_SIZE,),
nprocs=WORLD_SIZE,
join=True)
actual output:
inside train method
Rank 2/2 process initialized
Rank 1/2 training process passed data download barrier
starting epochs
Rank 1/2 process initialized
Rank 0/2 training process passed data download barrier
starting epochs
I expected output to be like:
inside train method.
Rank 1/2 process initialized.
Rank 2/2 process initialized.
Rank 1/2 training process passed data download barrier.
Rank 2/2 training process passed data download barrier.
starting epochs
starting epochs.
From this it is clear that first process did not wait for second process. Am I missing something?