How to reduce multi-node multi-GPU communication cost in DDP

zbh2047 · January 15, 2025, 11:32am

Hi, I am trying to use pytorch in multi-node multi-GPU training. I train a model with just 1M parameters on 128 GPUs. However, I found that the gradient all_reduce operation takes roughly half of the time! Specifically, when adding model.no_sync() to disable gradient sync, the training speed readily doubles. I also profiled the code and confirmed that there is a cudaStreamSynchronize operation that takes half of time.

I’d like to ask about any method that could potentially reduce such heavy communication cost. Can anyone help me? Thank you!

ptrblck · January 15, 2025, 2:12pm

Where is the sync coming from? Did you check if the communication IP is waiting for something and is being blamed wrongfully?

zbh2047 · January 16, 2025, 3:40am

cudaStreamSynchronize is called because I use torch.amp, which checks whether gradient contains nan and thus needs a grad.item() call. I don’t know the inner imeplemtation of DDP, but in the profile I can notice that nccl_all_reduce is overlapped with cudaStreamSynchronize. So I think cudaStreamSynchronize is slow because the communication between different GPUs is slow.

ptrblck · January 16, 2025, 1:45pm

To check the allreduce performance of your system for different sizes you could use the nccl-test as a reference. If the PyTorch code takes significantly longer for its communication calls, NCCL might be waiting and you would need to dig more into the profile.