What are the groups in distributedDataParalel and why do we need it

dang-qi · April 25, 2021, 4:21pm

I checked the code in detectron2 and I found that they build new group for each machine. I haven’t learned much about the distributed system and I am just curious about why do they do that. I tried to search for it but I couldn’t find it. Can anyone explain this?

pritamdamania87 · April 26, 2021, 6:30pm

I would recommend reading the following documentation: Distributed communication package - torch.distributed — PyTorch 1.8.1 documentation