Is average the correct way for the gradient in DistributedDataParallel with multi nodes?

teng-li · January 9, 2019, 6:40pm

@GeoffreyChen777 Yes, averaging is the correct way for gradient reduction among nodes. The reason you are seeing DataParallel adds gradients together is the correct way too,

The difference is that, DataParallel will split the batch size into sub-batches on each of the GPUs. When each GPU completes the computation, gradients are going to be reduced (added) onto the master GPU. Thinking about this as that: (1) this is a master-worker mode instead of true data parallel, since only the master GPU will scatter the batch and gather the results (2) we actually want to get the gradient of the total batch size, that’s why adding each worker’s gradient is the expected behavior. By comparison, Distributed Data Parallel goes completely parallel among distributed processes. And if the process itself has more than 1 GPU, the similar scatter and gather master worker mode will be employed similarly as DataParallel, and gradient will be added among worker GPU, and then averaged across distributed processes. The bottomline here is, the gradient will be averaged across data-parallel workers (processes), not slave workers (within a single process).