Hi, I am trying to use pytorch in multi-node multi-GPU training. I train a model with just 1M parameters on 128 GPUs. However, I found that the gradient all_reduce operation takes roughly half of the time! Specifically, when adding model.no_sync() to disable gradient sync, the training speed readily doubles. I also profiled the code and confirmed that there is a cudaStreamSynchronize operation that takes half of time.
I’d like to ask about any method that could potentially reduce such heavy communication cost. Can anyone help me? Thank you!
cudaStreamSynchronize is called because I use torch.amp, which checks whether gradient contains nan and thus needs a grad.item() call. I don’t know the inner imeplemtation of DDP, but in the profile I can notice that nccl_all_reduce is overlapped with cudaStreamSynchronize. So I think cudaStreamSynchronize is slow because the communication between different GPUs is slow.
To check the allreduce performance of your system for different sizes you could use the nccl-test as a reference. If the PyTorch code takes significantly longer for its communication calls, NCCL might be waiting and you would need to dig more into the profile.