Is average the correct way for the gradient in DistributedDataParallel with multi nodes?

Lausanne · February 10, 2019, 2:56am

@coincheung Your lr in torch.distributed mode should be 0.005