Reducer Buckets message with 4 GPUs

Hiperdyne19012 · October 21, 2020, 5:41am

What does “Reducer buckets have been rebuilt in this iteration” mean?
I got this at the start of training using 4 GPUs.

osalpekar · October 21, 2020, 6:12pm

This refers to some of the internals of PyTorch DDP - in each backward pass, DDP must allreduce gradients across all the nodes (4 GPUs in this case) so they are all in sync we reap the benefit of using multiple GPUs. We gather these gradients into buckets (which are 25mb by default), and we initiate the allreduce once the bucket is full. Once during the course of training, these buckets are actually allocated according to when the tensors may receive grads in the backward pass. The log message you saw simply indicates this buck allocation/rebuilding process has taken place, in this case, at the start of training.

PistonY · April 1, 2021, 3:05am

Do we need release this when we finish training?

H-Huang · April 1, 2021, 3:16pm

@PistonY The bucket lifecycle is handled for you internally.

wmweng · May 26, 2021, 11:38am

Only one GPU is used in my code, but “Reducer buckets have been rebuilt in this iteration” is still printed at the start of training. Is it normal?

clems · March 17, 2023, 12:03pm

Hi, any way one can silence this print? thx

pnnaman · March 17, 2023, 12:58pm

Hi @H-Huang ! I am looking to re-implement the gradient aggregation logic for a personal research project. Could you point me to the file this is implemented in?

H-Huang · March 20, 2023, 2:10pm

github.com

pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L835


      
                  "disable unused parameter detection by passing the keyword argument ",
                  "`find_unused_parameters=False` to ",
                  "`torch.nn.parallel.DistributedDataParallel`. If unused parameters ",
                  "in the model do not change over iterations, You can try to use ",
                  "_set_static_graph() as a workaround if this module graph does not ",
                  "change during training loop.");
              REDUCER_CHECK(!has_marked_unused_parameters_, logger_, common_error);
            }
          }
          
          
void Reducer::mark_variable_ready(size_t variable_index) {
            REDUCER_CHECK(
                variable_index < variable_locators_.size(),
                logger_,
                "Out of range variable index.");
          
          
  checkAndRaiseMarkedTwiceError(variable_index);
            perIterationReadyParams_.insert(variable_index);
            backward_stats_[variable_index] =
                current_time_in_nanos() - backward_compute_start_time_;

quornmd · September 25, 2023, 11:07pm

I get this before every batch. E.g.

[2023-09-26 00:06:21,660][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:21,660][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
Epoch: [0][0/98]        Time 2.962 (2.962)      Data 0.632 (0.632)      Loss 3.1528 (3.1528)    Adv Loss (Mean) nan     Acc@1 0.088     Acc@5 0.541
[2023-09-26 00:06:22,223][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,223][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,271][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,271][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,433][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,433][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,539][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,539][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,648][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,648][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,756][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,756][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
Epoch: [0][5/98]        Time 0.108 (0.596)      Data 0.001 (0.109)      Loss 7.5593 (5.2144)    Adv Loss (Mean) nan     Acc@1 0.099     Acc@5 0.521
[2023-09-26 00:06:22,868][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,868][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,973][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:22,973][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:23,078][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:23,078][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:23,185][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:23,185][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:23,291][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.
[2023-09-26 00:06:23,291][torch.nn.parallel.distributed][INFO] - Reducer buckets have been rebuilt in this iteration.

Can this cause performance issues?