The time cost of torch.distributed.all_reduce across ranks is inconsistent

I try torch.distributed.all_reduce and observe that the time overhead across machine is inconsistent. Especially rank2 machine achieve the smallest time overhead. My code is as follows:

def noop(net):
    grad_list = list() 
    for param in net.parameters(): 
        grad_list.append((param + torch.randn(param.shape).to(device)).flatten())
    flatten_grad = torch.cat(grad_list)
    start = time.time() 
    torch.distributed.all_reduce(flatten_grad, op=torch.distributed.ReduceOp.SUM)
    torch.cuda.synchronize()
    print('rank {}, after reduce sum grad {}, time cost {}'.format(adaptdl.env.replica_rank(), flatten_grad.sum(), time.time() - start))

def demo(net):
    for i in range(10): 
        noop(net)
    exit(0)

The result is as follows:


rank 1, after reduce sum grad 19065.580078125, time cost 0.15841436386108398
rank 0, after reduce sum grad 19065.580078125, time cost 0.1609196662902832
rank 2, after reduce sum grad 19065.580078125, time cost 0.04276561737060547
rank 1, after reduce sum grad 19919.873046875, time cost 0.15205001831054688
rank 0, after reduce sum grad 19919.873046875, time cost 0.11464929580688477
rank 2, after reduce sum grad 19919.873046875, time cost 0.04297232627868652
rank 0, after reduce sum grad 24817.13671875, time cost 0.11996173858642578
rank 1, after reduce sum grad 24817.13671875, time cost 0.1443774700164795
rank 2, after reduce sum grad 24817.13671875, time cost 0.04287433624267578
rank 1, after reduce sum grad 30847.12109375, time cost 0.14579391479492188
rank 0, after reduce sum grad 30847.12109375, time cost 0.12775921821594238
rank 2, after reduce sum grad 30847.12109375, time cost 0.042836904525756836
rank 1, after reduce sum grad 2664.037109375, time cost 0.13935065269470215
rank 0, after reduce sum grad 2664.037109375, time cost 0.13114142417907715
rank 2, after reduce sum grad 2664.037109375, time cost 0.04304385185241699

Is it normal or not?
Any potential ways to address this issue?

To measure the time accurately, you probably need another torch.cuda.synchronize before you execute the allreduce to ensure all previous GPU kernels have finished and only then start measuring time for allreduce.

The other way to accurately measure GPU time is using cuda events https://pytorch.org/docs/stable/generated/torch.cuda.Event.html

1 Like