Question about the need for distributed autograd context


I am trying to understand distributed autograd so I can use it correctly for my use case. Specifically, I’m confused about

Multiple nodes running distributed backward passes might accumulate gradients on the same tensor and as a result the .grad field of the tensor would have gradients from a variety of distributed backward passes before we have the opportunity to run the optimizer. This is similar to calling torch.autograd.backward() multiple times locally. In order to provide a way of separating out the gradients for each backward pass, the gradients are accumulated in the torch.distributed.autograd.context for each backward pass.

Why would multiple backward passes be executing backward pass at the same time? Are they for different batches? All the examples seem to show optimizer update at the end of processing each batch.

Hey @rahul003

This is for the use case where you have multiple independent trainers accessing the same parameter server (PS), where each training runs its own forward-backward-optimize in every iteration. With this setup, the parameters on the PS needs to remember gradients for different trainers, as they would run optimizer step independently. Hence, we cannot directly accumulate grads into param.grad field.

This is one example:

The idea is partially borrowed from Hogwild training:

BTW, could you please add a “distributed-rpc” tag for RPC-related questions so that the team can get back to you promptly, thanks!

1 Like