If we split a batch of data into multiple gpus, will it be mathematically equivalent (in terms of learned weights) to using whole batch in single gpu?
Following are the assumptions for both experiments
- using same seed
- no batch norm
- exactly same data in the batch
- for the sake of simplicity, let assume we are doing just single iteration step
- keeping the learning rate constant to 1
I read multiple places that they should be mathematically equal. But from my calculation below, I am not able to proof it
Let the model be, F(X) = w * X
Loss(X, F) = sum(F(X) - Const)
and aggregation function for multiple gpus synchronization is Mean function
Experiment 1: [batch_size=2, num_of_gpu=1]
Input: X = [x1, x2]
loss = w*(x1 + x2) - 2*Const
grad_single_gpu = grad(loss, w) = d(loss)/dw = x1+x2
Experiment 2: [batch_size=1, num_of_gpu=2]
GPU_0
Input: X = [x1]
loss = (w * x1) - Const
grad_gpu_0 = grad(loss, w) = d(loss)/dw = x1
GPU_1
Input: X = [x2]
loss = (w * x2) - Const
grad_gpu_1 = grad(loss, w) = d(loss)/dw = x2
Now, the gradient synchronization will average the gradients from both gpus, so effective gradient = (grad_gpu_0 + grad_gpu_1) / 2 = (x1+x2)/2, which is not equivalent to the experiment 1’s (x1+x2).
Am I missing something?
Moreover, in real training, the batches are selected at random, models are non-linear, learning rate is cosine-annealing. Wouldn’t these factors create even higher differences?