Reducer Buckets message with 4 GPUs

What does “Reducer buckets have been rebuilt in this iteration” mean?
I got this at the start of training using 4 GPUs.

2 Likes

This refers to some of the internals of PyTorch DDP - in each backward pass, DDP must allreduce gradients across all the nodes (4 GPUs in this case) so they are all in sync we reap the benefit of using multiple GPUs. We gather these gradients into buckets (which are 25mb by default), and we initiate the allreduce once the bucket is full. Once during the course of training, these buckets are actually allocated according to when the tensors may receive grads in the backward pass. The log message you saw simply indicates this buck allocation/rebuilding process has taken place, in this case, at the start of training.

8 Likes

Do we need release this when we finish training?

@PistonY The bucket lifecycle is handled for you internally.

Only one GPU is used in my code, but “Reducer buckets have been rebuilt in this iteration” is still printed at the start of training. Is it normal?

Hi, any way one can silence this print? thx

Hi @H-Huang ! I am looking to re-implement the gradient aggregation logic for a personal research project. Could you point me to the file this is implemented in?

I get this before every batch. E.g.

[2023-09-26 00:06:21,660][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:21,660][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
Epoch: [0][0/98]        Time 2.962 (2.962)      Data 0.632 (0.632)      Loss 3.1528 (3.1528)    Adv Loss (Mean) nan     Acc@1 0.088     Acc@5 0.541
[2023-09-26 00:06:22,223][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,223][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,271][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,271][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,433][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,433][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,539][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,539][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,648][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,648][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,756][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,756][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
Epoch: [0][5/98]        Time 0.108 (0.596)      Data 0.001 (0.109)      Loss 7.5593 (5.2144)    Adv Loss (Mean) nan     Acc@1 0.099     Acc@5 0.521
[2023-09-26 00:06:22,868][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,868][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,973][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,973][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:23,078][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:23,078][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:23,185][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:23,185][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:23,291][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:23,291][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.

Can this cause performance issues?

1 Like