Hi,
I am trying to understand distributed autograd so I can use it correctly for my use case. Specifically, I’m confused about
“”"
Multiple nodes running distributed backward passes might accumulate gradients on the same tensor and as a result the .grad
field of the tensor would have gradients from a variety of distributed backward passes before we have the opportunity to run the optimizer. This is similar to calling torch.autograd.backward()
multiple times locally. In order to provide a way of separating out the gradients for each backward pass, the gradients are accumulated in the torch.distributed.autograd.context
for each backward pass.
“”"
Why would multiple backward passes be executing backward pass at the same time? Are they for different batches? All the examples seem to show optimizer update at the end of processing each batch.