Is Data Parallel(or DDP) equivalent to larger batch

rui_zhang_331 · December 2, 2020, 6:46am

Suppose I have batch size 256, and if I have 2 GPUs to use Data parallel, I can split a 512 batch data into two 256 batches, but in final optimization, It uses sum up of individual loss gradients which equals to gradient of loss sum up. Ex: (f(x1) + f(x2))’ = f’(x1) + f’(x2) . dose that mean it is same as a single 512 batch size?

Since our model may not perform well in large batch size, but we have large training data, so would like to know whether we should consider DP or DDP

PistonY · December 2, 2020, 6:55am

You can regard DP as a single 512 batch size, but DDP is different.

rui_zhang_331 · December 2, 2020, 7:22am

can you elaborate more on why DDP is different? they still use All reduce to “average” gradient?

PistonY · December 2, 2020, 7:32am

I think this should be helpful.

rui_zhang_331 · December 2, 2020, 7:55am

Yes, I checked this thread, I knew DDP is better than DP as it dose fully All reduce and multi processing, instead DP need to calculate the loss in main GPU thus in balanced GPU utilization.

But here my question is if # of GPU is equivalent to batch_size * # in DP , is it same in DDP? since in DDP “At the end of the backwards pass, every node has the averaged gradients” , so to me seems both are doing some “average” here

PistonY · December 2, 2020, 8:06am

No, they are different. DDP will calculate loss on each device and get grad then do “average”, DP just do it in “main” device and update model.

rui_zhang_331 · December 2, 2020, 4:37pm

Thanks, I understand, but why DP is same as large batch size, but DDP is not? essentially they do some “average”, “sum” for all gradient calculated in each GPUs, so model side should be same?

PistonY · December 5, 2020, 12:17pm

DP will all gather NN result to main device and then all operations are doing in it, that may contain calculate loss get grad and update model param.
DDP keep result on each device until get grad, and do you say average then update model on each device.