Which code do the works ``` accumulate gradient ``` when using DataParallel?

I’ve read the code data_parallel.py and the implementation of optimizer.step, but I can’t find the code to accumulate the gradient from multi gpu to single one…
Can any give a hint about the implementation ?
Thanks !

It is the backward of Broadcast that calculates the gradient: https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/_functions.py#L8-L30. Gradients are automatically accumulated into .grad attribute in the autograd engine.

Thanks. Does it means that copying a variable from one gpu to another can still log to the computation graph ? Then when backward, the grad on different gpu will flow back.

Yes both for dataparallel and for just copying like x.cuda(1) or x.to(device). The backward for these two scenarios are implemented differently but they all work.

Thanks. I think with these truth, maybe I can do parallel training with batch which have different configuration for different gpu…