I’m trying to get DistributedDataParallel to work on a code, using pytorch/fairseq as a reference implementation. I’m finding the implementation there difficult to comprehend. I’ve opened an issue for the same. Below is a (hopefully) complete relevant extract. The uncommented segment I’ve already got working and loss in converging.
def train_step(self, sample):
sample = move_to(sample, self.device)
loss, batch_sizes = self.model(sample)
# 1: Is the below done implicitly
# seems to be missing in fairseq code.
# all-gather([loss, batch_sizes])
# loss = loss.sum()/batch_sizes.sum()
# 2: Something similar to the following
# exist. what is happening here?
# for p in parameters-optimized:
# p.grad = p.grad*distributed_world_size/batch_sizes.sum()
My concerns are:
Shouldn’t I be doing an all gather as indicated in code? Is this done implicitly?
Why would you need a gather on the loss? I can see how you might think the loss aggregation is needed for distributed training but what happens is the following. Each process computes its own output, using its own input, with its own activations, and computes its own loss. Then on loss.backward() all processes reduce their gradients. As loss.backward() returns, the gradients of your model parameters will be the same, and the optimizer in each process will perform the exact same update to the model parameters.
This normalizes the gradients w.r.t. the total number of processes. If you end up using torch.nn.parallel.DistributedDataParallel, this is already done for you. It is possible this is still a part of fairseq as earlier versions had a custom approach for distributed data parallelism, whereas newer versions can use the upstream wrapper directly (IIRC).
Hi! I need an advice. I have 4 processes/gpus with DDP. Should I implement Ioss reduction by sum (using all_reduce) before backward pass, or is it enough just for gradients to be automatically averaged by DDP? Could increasing the learningrate by a factor of x4 compensate for the division by number of gpus done by the averaging? I am trying to get a DDP run equivalent to Dataparallel.
I am trying to get a DDP run equivalent to Dataparallel.
There is a subtle difference between DP and DDP. IIUC, with DP, the grads from replicated models are accumulated (i.e., sum) into the param.grad field in the original model, but DDP’s gradient is averaged. Not 100% confident, but I feel if we would like to let DDP behave as similar to DP as possible, we probably should multiple DDP’s result gradient by world_size. Whether that is the same as using 4X learning rate, might depend on the optimizer algorithm.
I am working with fcos loss. The authors of fcos treat the case of DDP and implement reduction of the loss components inside the loss script. I should get rid of that part of their code then and do not use reduction before backward. I will use reduction just for plotting the loss values (after backward) in the training script.
Is it ok in your opinion? Thanks again!
From the discussion above I understand that the reason why one shouldn’t do an all_gather sum of the losses when training Distributed Data Parallel mode is that these all gather operations can slow down the process.
Are there any other reasons why the loss tensors should not be summed other than performance reasons?
I ask this because in case the loss tensors are small, if an all_gather sum is performed when computing the losses, this will result in identical losses for all processes. Therefore gradient averaging over processes will simply divide the losses by the number of processes.
This has the advantage of mimicking the behavior of DataParallel and of providing consistent results independently of the number of processes being run without the need to adjust learning rates, batch sizes, etc.
In short, when the cost of doing an all_gather sum of the losses is low, are there any other reasons beyond performance not to do it? And isn’t the consistent behavior independently of the number of processes an advantage?
The reason this is not sufficient is because the gradient computation depends on both loss and activation. And the activation depends on the input data, which is different in all processes. Therefore, even if loss is communicated, you will still need to communicate either gradients or activation to make sure all model parameters in all processes are consistent. Otherwise, if only communicating loss and then do backward locally, models from different processes might diverge.