Is average the correct way for the gradient in DistributedDataParallel with multi nodes?

@coincheung Your lr in torch.distributed mode should be 0.005