Inconsistent multi-node latency with NCCL
Hi,
I deployed PyTorch on 2 servers (with 1 GPU each), and I am trying to measure the communication latency using the following codes, which simply execute AllReduce operation for multiple times and calculate the average time spent.
def run(vector_size, rank, steps):
elapsedTime = 0
for step in range(1, steps + 1):
tensor = torch.randint(10,(vector_size,)).cuda()
start = time.monotonic_ns()
dist.all_reduce(tensor, op=dist.ReduceOp.SUM, async_op=False)
latency = (time.monotonic_ns() - start) / 1e3
elapsedTime += latency
# time.sleep(0.1)
elapsedTime/=steps
print(vector_size*4, elapsedTime)
I found the measured latency abnormally high with PyTorch 1.10 + NCCL 2.10:
size(B) latency(us)
8 826.9433000000001
16 908.8419
32 1479.80385
64 2279.2819499999996
128 504.1064
256 1348.6622499999999
512 1123.6129000000003
1024 2590.2159
2048 1715.0593000000001
4096 5227.415999999999
8192 3131.40595
16384 3009.81275
32768 1614.2130499999998
65536 6010.794950000001
131072 6169.70775
262144 6595.7269
524288 4651.931450000001
1048576 5800.938
2097152 7393.041899999999
However, if I add time.sleep(0.1)
at the end of each iteration, the latency becomes much smaller:
size(B) latency(us)
8 153.83099999999996
16 157.3773
32 157.008
64 157.7295
128 140.99030000000002
256 130.14204999999998
512 107.28104999999998
1024 117.73960000000002
2048 87.42374999999997
4096 86.94415000000002
8192 110.07860000000001
16384 116.90845000000002
32768 224.9045
65536 113.87135
131072 409.87255000000016
262144 370.2254
524288 837.1048000000001
1048576 1105.72925
2097152 3323.8366499999997
The inconsistency also happens between different versions of NCCL. I have recompiled PyTorch with the official build of NCCL 2.11, which has similar results. However, with NCCL 2.7, the latency is always small regardless the interval. The interval does not impact the latency of GLOO, either.
What may be the reason of these different latency values? And what is the correct way to measure the performance of AllReduce operations in PyTorch? Thanks!
Some of our other system information:
OS: Ubuntu 20.04
GPU: Tesla V100 (No GPUDirect support)
Network Interface: Mellanox mlx5 (We use RoCEv2 for NCCL)