Inconsistent multi-node latency with NCCL

Inconsistent multi-node latency with NCCL

Hi,

I deployed PyTorch on 2 servers (with 1 GPU each), and I am trying to measure the communication latency using the following codes, which simply execute AllReduce operation for multiple times and calculate the average time spent.

def run(vector_size, rank, steps):
    elapsedTime = 0
    for step in range(1, steps + 1):
        tensor = torch.randint(10,(vector_size,)).cuda()
        start = time.monotonic_ns()
        dist.all_reduce(tensor, op=dist.ReduceOp.SUM, async_op=False)
        latency = (time.monotonic_ns() - start) / 1e3
        elapsedTime += latency
        # time.sleep(0.1)

    elapsedTime/=steps
    print(vector_size*4, elapsedTime)

I found the measured latency abnormally high with PyTorch 1.10 + NCCL 2.10:

size(B)      latency(us)
8            826.9433000000001
16           908.8419
32           1479.80385
64           2279.2819499999996
128          504.1064
256          1348.6622499999999
512          1123.6129000000003
1024         2590.2159
2048         1715.0593000000001
4096         5227.415999999999
8192         3131.40595
16384        3009.81275
32768        1614.2130499999998
65536        6010.794950000001
131072       6169.70775
262144       6595.7269
524288       4651.931450000001
1048576      5800.938
2097152      7393.041899999999

However, if I add time.sleep(0.1) at the end of each iteration, the latency becomes much smaller:

size(B)      latency(us)
8            153.83099999999996
16           157.3773
32           157.008
64           157.7295
128          140.99030000000002
256          130.14204999999998
512          107.28104999999998
1024         117.73960000000002
2048         87.42374999999997
4096         86.94415000000002
8192         110.07860000000001
16384        116.90845000000002
32768        224.9045
65536        113.87135
131072       409.87255000000016
262144       370.2254
524288       837.1048000000001
1048576      1105.72925
2097152      3323.8366499999997

The inconsistency also happens between different versions of NCCL. I have recompiled PyTorch with the official build of NCCL 2.11, which has similar results. However, with NCCL 2.7, the latency is always small regardless the interval. The interval does not impact the latency of GLOO, either.

What may be the reason of these different latency values? And what is the correct way to measure the performance of AllReduce operations in PyTorch? Thanks!

Some of our other system information:

OS: Ubuntu 20.04

GPU: Tesla V100 (No GPUDirect support)

Network Interface: Mellanox mlx5 (We use RoCEv2 for NCCL)

Thanks @Qiaofeng! Do you mind opening a GitHub issue for this? Looks like it needs some investigation on our end.

Thank you for the reply! We also found MPI environment might have some impact on this problem. I have opened an GitHub issue with more detailed descriptions.

1 Like