Reducer Buckets message with 4 GPUs

What does “Reducer buckets have been rebuilt in this iteration” mean?
I got this at the start of training using 4 GPUs.

1 Like

This refers to some of the internals of PyTorch DDP - in each backward pass, DDP must allreduce gradients across all the nodes (4 GPUs in this case) so they are all in sync we reap the benefit of using multiple GPUs. We gather these gradients into buckets (which are 25mb by default), and we initiate the allreduce once the bucket is full. Once during the course of training, these buckets are actually allocated according to when the tensors may receive grads in the backward pass. The log message you saw simply indicates this buck allocation/rebuilding process has taken place, in this case, at the start of training.

7 Likes

Do we need release this when we finish training?

@PistonY The bucket lifecycle is handled for you internally.

Only one GPU is used in my code, but “Reducer buckets have been rebuilt in this iteration” is still printed at the start of training. Is it normal?