Rank 0 gpu of torch.nn.parallel.DistributedDataParllel will gather gradients from other gpus?

Kyrie_Yan · April 10, 2019, 10:11am

When I use torch.nn.parallel.DistributedDataParllel for training , I found process launched for training on all other gpus would take up some memory at GPU 0 ,i.e worker 0. This take much more memory of GPU 0 that I can’t use bigger batch. How Can I solve this? Is it because other process send gradients to 0 by default to calculate the average???

Kyrie_Yan · April 10, 2019, 10:12am

By the way in the screenshot, the process for training step on GPU0 was killed since it crashes due to outof memory error

davy · September 18, 2019, 5:08am

hi, i meet same problem , have you solve it? thx