Why each GPU measures a difference allreduce duration?

Olivier-CR · November 9, 2022, 5:28pm

Hi, I’m running an allreduce with torch.distributed on a cluster of 2 AWS P4d instances (2*8 A100 GPUs).

I’m launching the 16 processes with MPI, and the allreduce with that function:

def run(rank, local_rank):
        """ Simple allreduce. """
        tensor = torch.rand(int(1e6)*args.tensor_size_mm, dtype=torch.float32).to(cuda)
        t1 = time.time()
        dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
        dist.barrier()        
        t2 = time.time()

        return tensor.element_size()*tensor.numel()/1e6, t2-t1

results = []  # average over N allreduces to remove jitter
for i in range(args.n_attempts):
    size, duration = run(WORLD_RANK, LOCAL_RANK)
    results.append(duration)
    print("Run {} - allreduce of {}MB tensor done in {}s".format(
        i, size, duration))

I run 5 allreduces one after the other to remove the impact of network jitter. Suprisingly, the measured allreduce duration is vastly different on every GPU! with differences as big as 5x ratio. Below a measurement I did with a 8GB tensor

Why does each GPU measures a different allreduce time? What is the proper way to measure the duration of an allreduce operation, if all processes report something different

cbalioglu · November 9, 2022, 5:57pm

@kwen2501 any thoughts on this?

Olivier-CR · November 9, 2022, 6:34pm

It was presumably because each rank was starting at slightly different times. When I add an extra barrier, the allreduce durations are identical to the millisecond

def run(rank, local_rank):
    """ Simple allreduce. """
    coeffs = int(1e6*args.tensor_size_mb*8/32)
    tensor = torch.rand(coeffs, dtype=torch.float32).to(cuda)
    dist.barrier()  # HERE extra barrier before the allreduce
    t1 = time.time()
    print("rank {} started allreduce at time {}".format(WORLD_RANK, t1))
    dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
    dist.barrier()        
    t2 = time.time()
    print("rank {} finished allreduce at time {}".format(WORLD_RANK, t2))

kwen2501 · November 9, 2022, 10:48pm

Thanks for finding the solution. Yes, it would be better to synchronize the processes before entering the timing region.