I am trying to understand distributed autograd so I can use it correctly for my use case. Specifically, I’m confused about
Multiple nodes running distributed backward passes might accumulate gradients on the same tensor and as a result the
.grad field of the tensor would have gradients from a variety of distributed backward passes before we have the opportunity to run the optimizer. This is similar to calling
torch.autograd.backward() multiple times locally. In order to provide a way of separating out the gradients for each backward pass, the gradients are accumulated in the
torch.distributed.autograd.context for each backward pass.
Why would multiple backward passes be executing backward pass at the same time? Are they for different batches? All the examples seem to show optimizer update at the end of processing each batch.