Hi, I’m running an allreduce with torch.distributed on a cluster of 2 AWS P4d instances (2*8 A100 GPUs).
I’m launching the 16 processes with MPI, and the allreduce with that function:
def run(rank, local_rank):
""" Simple allreduce. """
tensor = torch.rand(int(1e6)*args.tensor_size_mm, dtype=torch.float32).to(cuda)
t1 = time.time()
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
dist.barrier()
t2 = time.time()
return tensor.element_size()*tensor.numel()/1e6, t2-t1
results = [] # average over N allreduces to remove jitter
for i in range(args.n_attempts):
size, duration = run(WORLD_RANK, LOCAL_RANK)
results.append(duration)
print("Run {} - allreduce of {}MB tensor done in {}s".format(
i, size, duration))
I run 5 allreduces one after the other to remove the impact of network jitter. Suprisingly, the measured allreduce duration is vastly different on every GPU! with differences as big as 5x ratio. Below a measurement I did with a 8GB tensor
Why does each GPU measures a different allreduce time? What is the proper way to measure the duration of an allreduce operation, if all processes report something different