Comparison Data Parallel Distributed data parallel

mrshenli · August 18, 2020, 9:40pm

what is different between reducing gradients and weight update.

There are many weight updating algorithms, e.g., Adam, SGD, Adagrad, etc. (see more here). And they are all independent from DP or DDP. So even if the gradient is the same, different optimizers can update the weight to a different value.

Reducing gradients in DDP basically means communicating gradients across processes.

Do you mean, DP and DDP exactly update the same weight and same updated each layer right?

Neither DP nor DDP touches model weight. In the following code, it is the optimzer.step() that updates model weights. What DP and DDP do are preparing the .grad field for all parameters.

output = model(input)
output.sum().backward()
# DP and DDP not involved in the below this point.
opt.step()

It is also confusing to me. Do you mean Batch size or LR size? You link the batch size about it.

Quoting some discussion from that link. If you search for “lr”, you will find almost all comments in that thread discusses how to configure LR and batch size.

I face that there is no improvement when I use the DDP with synchronized BN. That is why I am asking third question.

Right, SyncBatchNorm has its own way for communication, which is out of control of DDP. Using DDP won’t change how SyncBatchNorm behaves.

github.com

pytorch/pytorch/blob/f64d24c941a00bc81b3017008ae212cca761d393/torch/nn/modules/_functions.py#L79-L81


      
          torch.distributed.all_reduce(
              combined, torch.distributed.ReduceOp.SUM, process_group, async_op=False)
          sum_dy, sum_dy_xmu = torch.split(combined, num_channels)