I am a bit confused about averaging gradients in distributed data-parallel. It seems there are two examples from the PyTorch documentation that are different. In one example, you create the model and just pass it to the GPU available then create a separate function to average gradients.
""" Gradient averaging. """
size = float(dist.get_world_size())
for param in model.parameters():
param.grad.data /= size
and its executed as follows.
outputs = model(inputs)
loss = loss_function(outputs, labels)
The other approach I have seen doesn’t create a separate function and just calls DPP.
model = ToyModel().cuda(device_ids)
ddp_model = DDP(model, device_ids)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(device_ids)
I would like to know whats the difference between the two approaches and which one should one use for distributed training in a HPC cluster. I specifically want to use two nodes, each with 4 GPUs.
Hey @ankahira, usually, there are 4 steps in distributed data parallel training:
- local forward to compute loss
- local backward to compute local gradients
- allreduce (communication) to compute global gradients. This would be allreduce with SUM + divide by world size to calculate average
- optimizer step to use global gradients to update parameters
Both examples you mentioned above conduct the same four steps and are mathematically equivalent. The difference is that DDP would allow step 2 (backward computation) and 3 (allreduce communication) to overlap and therefore DDP is expected to be faster than the
More specifically, in the first example with
average_gradients, there is a hard barrier between backward and allreduce, i.e., no comm can start before computation finishes. In the second example, DDP organizes gradients into buckets, and will launch comm as soon as a bucket of gradients are ready, so that computation and communication can run in parallel. This would help explain that.
I would recommend DDP.
Why would we need to divide the gradients by word size in step 3?
Because in many cases, the loss function computes per-sample loss (e.g., default
MSELoss) instead of aggregated loss (e.g., default
sum). So, DDP calculates the average and tries to keep the gradients scale consistent with local training.