While using the latest version of DDP during training with 4 GPUs I am getting log “Reducer buckets have been rebuilt in this iteration” 3 times at the beginning of traning, while in the doc string it is stated that the rebuilding of the buckets should be done only once. My understanding is that I see 3 logs because there are 3 additional GPUs on which the buckets should be built but it doesn’t seem to be a trustworthy explanation. Thank you!
Hi, Shen, thank you for replying. I’ve actually found out that the extra logs were caused by the arch of my model. I have 3 models and 3 optimizers in my framework, so it makes sense now, why ddp called allocation of buckets 3 times.
@space1panda Hi! I have 3 models and 3 optimizers in my framework as well. Actually, I implemented three torch.nn.parallel.DistributedDataParallel classes as follows, instead of ONE. And also the log ‘Reducer buckets have been rebuilt in this iteration.’ repeated three times. I only use one GPU to train. So it is ok to use multiple DistributedDataParallel models in Pytorch DDP?