Hello.
I’ve read the code data_parallel.py
and the implementation of optimizer.step
, but I can’t find the code to accumulate the gradient from multi gpu to single one…
Can any give a hint about the implementation ?
Thanks !
It is the backward of Broadcast that calculates the gradient: https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/_functions.py#L8-L30. Gradients are automatically accumulated into .grad attribute in the autograd engine.
Thanks. Does it means that copying a variable from one gpu to another can still log to the computation graph ? Then when backward, the grad on different gpu will flow back.
Yes both for dataparallel and for just copying like x.cuda(1)
or x.to(device)
. The backward for these two scenarios are implemented differently but they all work.
Thanks. I think with these truth, maybe I can do parallel training with batch which have different configuration for different gpu…