Is Data Parallel(or DDP) equivalent to larger batch

Suppose I have batch size 256, and if I have 2 GPUs to use Data parallel, I can split a 512 batch data into two 256 batches, but in final optimization, It uses sum up of individual loss gradients which equals to gradient of loss sum up. Ex: (f(x1) + f(x2))’ = f’(x1) + f’(x2) . dose that mean it is same as a single 512 batch size?

Since our model may not perform well in large batch size, but we have large training data, so would like to know whether we should consider DP or DDP

You can regard DP as a single 512 batch size, but DDP is different.

can you elaborate more on why DDP is different? they still use All reduce to “average” gradient?

I think this should be helpful.

Yes, I checked this thread, I knew DDP is better than DP as it dose fully All reduce and multi processing, instead DP need to calculate the loss in main GPU thus in balanced GPU utilization.

But here my question is if # of GPU is equivalent to batch_size * # in DP , is it same in DDP? since in DDP “At the end of the backwards pass, every node has the averaged gradients” , so to me seems both are doing some “average” here

No, they are different. DDP will calculate loss on each device and get grad then do “average”, DP just do it in “main” device and update model.

Thanks, I understand, but why DP is same as large batch size, but DDP is not? essentially they do some “average”, “sum” for all gradient calculated in each GPUs, so model side should be same?

DP will all gather NN result to main device and then all operations are doing in it, that may contain calculate loss get grad and update model param.
DDP keep result on each device until get grad, and do you say average then update model on each device.