Gradient accumulation for 'DataParallel'

Xnming · August 30, 2018, 7:54am

The document about ‘DataParallel’ says “During the backwards pass, gradients from each replica are summed into the original module”. I am wondering how is this implemented to accumulate the gradients from different device ? I have read the code but still get confused.

Senhui_Guo · February 17, 2019, 12:43pm

the function replicate is part of the computational graph.
So there is an AllReduce function in the backward of replicate, the the gradients are reduced here