Traversing the recorded graph manually

Hi,

The question about the apply is answered here: How does pytorch implements backward process?

I am not very familiar with DistributedDataParallel unfortunately, but it is most likely just uses the autograd logic that makes sure that multiple use of a single Tensor will accumulate all the gradients because performing the next step. Basically, the op before the Reducer knows that 3 copies exists and so will wait 3 completion before executing.