Nccl allreduce performace

JF-D · January 30, 2022, 1:21pm

Question

I want to analysis the speed of distributed data parallel training of alexnet. But I find that the speed of allreduce invoked during alexnet training is slower than that directly invoked by nccl-tests or pytorch allreduce test.

Environment

pytorch1.5 + cuda9.0 + nccl2.7.8
4x1080ti

Details of Performance

Alexnet has about 244MB parameters, so I test the allreduce performance of this communication volume with 4 devices.

Speed of nccl allreduce by directly testing

I use nccl-tests to test allreduce. The time cost is about 50.5ms.
I also test it with the following pytorch code. The time cost is about 50.4ms.

def all_reduce_latency(nbytes):
    buf = torch.randn(nbytes // 4).cuda()
    torch.cuda.synchronize()
    # warmup
    for _ in range(5):
        dist.all_reduce(buf)
    torch.cuda.synchronize()

    torch.cuda.synchronize()
    begin = time.perf_counter()
    for _ in range(25):
        dist.all_reduce(buf)
    torch.cuda.synchronize()
    end = time.perf_counter()
    avg_speed = (end - begin) * 1e6 / 25
    return avg_speed

Speed of nccl allreduce in alexnet training

I use ddp with a very large bucket size to force that all gradient are fused to a single buffer, and the gradient communication is not overlapped with computation. I found that the speed of allreduce is about 67.2ms, which is very slow.

I also tried manually fuse gradients to a buffer to call allreduce instead of using ddp. There is no difference. NCCL allreduce is much slower when called during alexnet training. I don’t know what cause this. Can you give some suggestions to find this problem?

Thanks a lot!

H-Huang · February 1, 2022, 8:01pm

cc @mrshenli for NCCL allreduce delay