Suppose I have batch size 256, and if I have 2 GPUs to use Data parallel, I can split a 512 batch data into two 256 batches, but in final optimization, It uses sum up of individual loss gradients which equals to gradient of loss sum up. Ex: (f(x1) + f(x2))’ = f’(x1) + f’(x2) . dose that mean it is same as a single 512 batch size?

Since our model may not perform well in large batch size, but we have large training data, so would like to know whether we should consider DP or DDP

Yes, I checked this thread, I knew DDP is better than DP as it dose fully All reduce and multi processing, instead DP need to calculate the loss in main GPU thus in balanced GPU utilization.

But here my question is if # of GPU is equivalent to batch_size * # in DP , is it same in DDP? since in DDP “At the end of the backwards pass, every node has the averaged gradients” , so to me seems both are doing some “average” here

Thanks, I understand, but why DP is same as large batch size, but DDP is not? essentially they do some “average”, “sum” for all gradient calculated in each GPUs, so model side should be same?

DP will all gather NN result to main device and then all operations are doing in it, that may contain calculate loss get grad and update model param.
DDP keep result on each device until get grad, and do you say average then update model on each device.