V1.7.0 Reducer buckets have been rebuilt in this iteration logs

space1panda · December 9, 2020, 10:47am

Hello,

While using the latest version of DDP during training with 4 GPUs I am getting log “Reducer buckets have been rebuilt in this iteration” 3 times at the beginning of traning, while in the doc string it is stated that the rebuilding of the buckets should be done only once. My understanding is that I see 3 logs because there are 3 additional GPUs on which the buckets should be built but it doesn’t seem to be a trustworthy explanation. Thank you!

mrshenli · December 9, 2020, 4:18pm

Hey @space1panda, does the same process print that log 3 times or does the log come from different processes?

space1panda · December 9, 2020, 5:11pm

Hi, Shen, thank you for replying. I’ve actually found out that the extra logs were caused by the arch of my model. I have 3 models and 3 optimizers in my framework, so it makes sense now, why ddp called allocation of buckets 3 times.

wmweng · May 23, 2021, 7:54am

@space1panda Hi! I have 3 models and 3 optimizers in my framework as well. Actually, I implemented three torch.nn.parallel.DistributedDataParallel classes as follows, instead of ONE. And also the log ‘Reducer buckets have been rebuilt in this iteration.’ repeated three times. I only use one GPU to train. So it is ok to use multiple DistributedDataParallel models in Pytorch DDP?

model1 = ddp(model1, device_ids=[local_rank], output_device=local_rank)
model2 = ddp(model2, device_ids=[local_rank], output_device=local_rank)
model3 = ddp(model3, device_ids=[local_rank], output_device=local_rank)