Theoretical equivalency of weights between Single and Multi-GPUs

If we split a batch of data into multiple gpus, will it be mathematically equivalent (in terms of learned weights) to using whole batch in single gpu?

Following are the assumptions for both experiments

  1. using same seed
  2. no batch norm
  3. exactly same data in the batch
  4. for the sake of simplicity, let assume we are doing just single iteration step
  5. keeping the learning rate constant to 1

I read multiple places that they should be mathematically equal. But from my calculation below, I am not able to proof it

Let the model be, F(X) = w * X
Loss(X, F) = sum(F(X) - Const)
and aggregation function for multiple gpus synchronization is Mean function

Experiment 1: [batch_size=2, num_of_gpu=1]
Input: X = [x1, x2]
loss = w*(x1 + x2) - 2*Const
grad_single_gpu = grad(loss, w) = d(loss)/dw = x1+x2

Experiment 2: [batch_size=1, num_of_gpu=2]
GPU_0
Input: X = [x1]
loss = (w * x1) - Const
grad_gpu_0 = grad(loss, w) = d(loss)/dw = x1

GPU_1
Input: X = [x2]
loss = (w * x2) - Const
grad_gpu_1 = grad(loss, w) = d(loss)/dw = x2

Now, the gradient synchronization will average the gradients from both gpus, so effective gradient = (grad_gpu_0 + grad_gpu_1) / 2 = (x1+x2)/2, which is not equivalent to the experiment 1’s (x1+x2).

Am I missing something?

Moreover, in real training, the batches are selected at random, models are non-linear, learning rate is cosine-annealing. Wouldn’t these factors create even higher differences?

Yes, the default loss reduction is computing the mean by summing the gradient and dividing by the batch size. Assuming the batch size is equal on all GPUs, the result will be the same.

1 Like

Got it. Thanks. Does the same applies to multi-node training as-well?