In pytorch, they provide the capability of async allreduce. And How they can control this.
For example, fastest GPU0 can ready gradient of bucket size, and launch the allreduce. But other GPU are not ready for it. And how GPU0 can wait until other GPU can launch the allreduce.
As I know, In multi-GPU system, All-GPU launches the allreduce kernel simultaneously, But they can do asynchronously execute the allreduce. Or it control the nccl…? or pytorch.