Torch.distributed.barrier is not working

Saurav_Gupta1 · February 6, 2021, 12:27pm

Hi. As per my understanding torch.distributed.barrier will put the first process on hold until all the other processes has reached to the same point. Well it does not seem to be working on the cluster. I am using nccl backend. I have added some print statements just to debug.

My code:

def init_process(rank, size, backend='nccl'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)

def train(rank, world_size):
  
    print("inside train method")

    init_process(rank, world_size, "nccl")
    print(
        f"Rank {rank + 1}/{world_size} process initialized."
    )
    
    if rank == 0:  
        get_dataloader(rank, world_size)
        model.gru()
    dist.barrier()
    print(f"Rank {rank + 1}/{world_size} training process passed data download barrier.")

def main():
    mp.spawn(train,
        args=(WORLD_SIZE,),
        nprocs=WORLD_SIZE,
        join=True)

actual output:

inside train method
Rank 2/2 process initialized
Rank 1/2 training process passed data download barrier
starting epochs
Rank 1/2 process initialized
Rank 0/2 training process passed data download barrier
starting epochs

I expected output to be like:
inside train method.
Rank 1/2 process initialized.
Rank 2/2 process initialized.
Rank 1/2 training process passed data download barrier.
Rank 2/2 training process passed data download barrier.
starting epochs
starting epochs.

From this it is clear that first process did not wait for second process. Am I missing something?

osalpekar · February 8, 2021, 7:37pm

I’m a bit confused by the output logs. How does Rank 1 log “process initialized” after logging “training process passed data download barrier starting epochs”? Also why are there 3 ranks with a world size of 2? Lastly, “inside train method” should be logged twice if you are spawning 2 processes.

I reproduced your script, and swapped out the get_dataloader() and model.gru() calls with a print statement that prints “Just before barrier”. I got the expected output:

spawning
inside train method
inside train method
Rank 1/2 process initialized.
Rank 2/2 process initialized.
Rank 1/2 Just before barrier
Rank 1/2 training process passed data download barrier.
Rank 2/2 training process passed data download barrier.

barrier() should indeed block until all the nodes reach the barrier. It would be worth debugging if there is something odd occuring in the get_dataloader() and model.gru() calls. For a lighter weight barrier implementation, you can try calling dist.allreduce on a very small (or empty) torch tensor and this should mimic the barrier behavior.

Saurav_Gupta1 · February 9, 2021, 10:17pm

sorry my bad my output is

inside train method
Rank 2/2 process initialized
Rank 2/2 training process passed data download barrier
starting epochs
Rank 1/2 process initialized
Rank 1/2 training process passed data download barrier
starting epochs

But I expected output to be like yours.