When I use DataParallel in one machine with two GPUs with 8 batch size(4 on each GPU), I get a satisfied training result. But, if I use DistributedDataParallel on two single GPU machines with 8 batch size(4 on each node), the training result is dissatisfied and convergence speed is slower than the DataParallel.
After checking the doc of DataParallel and DistributedDataParallel, I noticed that DataParallel sum the gradient of each GPU, DistributedDataParallel average the gradient of each node(GPU under my condition).
I think this difference is the reason for the different training results.
Is average the correct way for the gradient in DistributedDataParallel with multi-node? Should I modify the DistributedDataParallel to sum the gradient of each node to reproduce the same training result in my exam?