Which code do the works ``` accumulate gradient ``` when using DataParallel?

chenchr · September 5, 2018, 7:23am

Hello.
I’ve read the code data_parallel.py and the implementation of optimizer.step, but I can’t find the code to accumulate the gradient from multi gpu to single one…
Can any give a hint about the implementation ?
Thanks !

SimonW · September 6, 2018, 9:28pm

It is the backward of Broadcast that calculates the gradient: https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/_functions.py#L8-L30. Gradients are automatically accumulated into .grad attribute in the autograd engine.

chenchr · September 7, 2018, 4:42pm

Thanks. Does it means that copying a variable from one gpu to another can still log to the computation graph ? Then when backward, the grad on different gpu will flow back.

SimonW · September 7, 2018, 5:20pm

Yes both for dataparallel and for just copying like x.cuda(1) or x.to(device). The backward for these two scenarios are implemented differently but they all work.

chenchr · September 7, 2018, 5:25pm

Thanks. I think with these truth, maybe I can do parallel training with batch which have different configuration for different gpu…