V1.7.0 Reducer buckets have been rebuilt in this iteration logs


While using the latest version of DDP during training with 4 GPUs I am getting log “Reducer buckets have been rebuilt in this iteration” 3 times at the beginning of traning, while in the doc string it is stated that the rebuilding of the buckets should be done only once. My understanding is that I see 3 logs because there are 3 additional GPUs on which the buckets should be built but it doesn’t seem to be a trustworthy explanation. Thank you!

1 Like

Hey @space1panda, does the same process print that log 3 times or does the log come from different processes?

Hi, Shen, thank you for replying. I’ve actually found out that the extra logs were caused by the arch of my model. I have 3 models and 3 optimizers in my framework, so it makes sense now, why ddp called allocation of buckets 3 times.

1 Like

@space1panda Hi! I have 3 models and 3 optimizers in my framework as well. Actually, I implemented three torch.nn.parallel.DistributedDataParallel classes as follows, instead of ONE. And also the log ‘Reducer buckets have been rebuilt in this iteration.’ repeated three times. I only use one GPU to train. So it is ok to use multiple DistributedDataParallel models in Pytorch DDP?

model1 = ddp(model1, device_ids=[local_rank], output_device=local_rank)
model2 = ddp(model2, device_ids=[local_rank], output_device=local_rank)
model3 = ddp(model3, device_ids=[local_rank], output_device=local_rank)