Hi, I have a question about DDP computing average. I read in DDP backward pass that says when all buckets are ready, local Reducer will block waiting for all allreduce to opertions to finish. But what will happen when several GPU run in different speed? For example, when two GPUs(e.g. cuda:0 and cuda:1) run 1.5x faster than other GPUs(codes and processing are the same), cuda:0 and cuda:1 will produce more gradients. Will they save these gradients in the bucket and wait for other GPUs to get ready, or they just abandon these gradients and reducer gradients that are ready in all GPUs?
The faster GPU processes will wait for other GPUs to finish their backward computation.
By default, DDP synchronizes gradients and parameters and then performs the next forward computation.
The differing speed case is quite common in practice.
You can take a look at the forward function of DDP.
Just to add some more insight to this, we have a bucket_cap_mb argument in the DDP constructor. This defines the size of a gradient bucket in megabytes. During the backward pass, each rank fills the bucket with gradients and then kicks off the allreduce collective. Faster ranks will kick off the allreduce collective earlier than the slower ranks, so they will just block until the slower ranks kick off the collective. No gradients are abandoned in this process. You can tune the bucket_cap_mb as desired, and this will trigger allreduce more frequently for smaller buckets and less frequently for larger buckets.
If the performance difference is too great, you can explore syncing gradients less frequently (every n batches instead of every batch) using the model.no_sync() context manager or using multiple process groups (using the new_group API). If you find DDP training getting stuck due to excessively long hang-times due to these blocked collectives, you may look into using torchelastic and some mechanism to timeout hanging collectives (such as NCCL_ASYNC_ERROR_HANDLING or NCCL_BLOCKING_WAIT - docs here)