DataParallel results in a different network compared to a single GPU run

Thanks a lot for the clarification. Yeah that makes sense, because like you say, it’s not like k “regular” batch updated, but we essentially aggregate the results from the k GPUs before we apply the update.

On the other hand, if I choose a 4-times larger batch size compared to the 1 GPU version, I thought the results are the same, because each GPU would compute the predictions separately, but then the default device would combine the loss of the combined batches to compute the gradient. (but the downside would then be that there couldn’t be an efficient backward pass leveraging parallelism)

From the post you link (Debugging DataParallel, no speedup and uneven memory allocation - #13 by ngimel),

in forward:

  • scatter mini-batch to GPU1, GPU2
  • replicate model on GPU2 (it is already on GPU1)
  • model_gpu1(input_gpu1), model_gpu2(input_gpu2) (this step is parallel_apply)
  • gather output mini-batch from GPU1, GPU2 onto GPU1

in backward:

  • scatter grad_output and input
  • parallel_apply model’s backward pass
  • reduce GPU2 replica’s gradients onto GPU1 model

Why do we need to “gather output mini-batch from GPU1, GPU2 onto GPU1” as the last step of the forward pass? If each GPU has its own gradient (computed based on the scattered mini-batch), this shouldn’t be necessary, right?

1 Like