I notice that I get different results when I use DataParallel compared to a single GPU run. When I execute the same code on 1 GPU, I get the same loss if I repeat that procedure (assuming a fixed random seed). For some reason though, the loss is ~5% different if I use DataParallel
. Is that a bug or some design-workaround that is a consequence of summing up the gradients?
I think the difference might come from the gradient reduction of all device gradients after the backward pass.
Here is an overview of the DataParallel
algorithm.
Unfortunately, I can’t test it right now, but from my understanding after the parallel_apply
of model’s backward pass, the gradients will be reduced and thus accumulated on the default device.
In a general sense, you are using a larger batch size for the same model, which also might change the training behavior. While the parameters of the model are usually updated after each batch iteration, you are using the same parameters for multiple batches (on the different GPU).
Let me know, if that makes sense or if I’m completely mistaken.
As I can’t test it right now, please take this statement with a grain of salt!
Thanks a lot for the clarification. Yeah that makes sense, because like you say, it’s not like k “regular” batch updated, but we essentially aggregate the results from the k GPUs before we apply the update.
On the other hand, if I choose a 4-times larger batch size compared to the 1 GPU version, I thought the results are the same, because each GPU would compute the predictions separately, but then the default device would combine the loss of the combined batches to compute the gradient. (but the downside would then be that there couldn’t be an efficient backward pass leveraging parallelism)
From the post you link (Debugging DataParallel, no speedup and uneven memory allocation - #13 by ngimel),
in forward:
- scatter mini-batch to GPU1, GPU2
- replicate model on GPU2 (it is already on GPU1)
- model_gpu1(input_gpu1), model_gpu2(input_gpu2) (this step is parallel_apply)
- gather output mini-batch from GPU1, GPU2 onto GPU1
in backward:
- scatter grad_output and input
- parallel_apply model’s backward pass
- reduce GPU2 replica’s gradients onto GPU1 model
Why do we need to “gather output mini-batch from GPU1, GPU2 onto GPU1” as the last step of the forward pass? If each GPU has its own gradient (computed based on the scattered mini-batch), this shouldn’t be necessary, right?
That reason for this is the loss calculation which will take place on the default device.
As a small side note, this is also the reason you see a bit more memory consumption on one device.
You could skip it by adding your loss directly into your model and calculate the losses on each replica.
Then only the (scalar) loss value has to be reduced.
You could skip it by adding your loss directly into your model and calculate the losses on each replica.
ohhh, I see now. This is basically because the dataset (labels) sit on the first device. It’s not a design choice but rather a consequence of where I put the loss function (or rather the inputs for the loss function in terms of the labels). Makes so much sense now. Thanks!
Just wanted to leave a note regarding Batchnorm. The running stats might differ too in case of multi gpu usage instead of single gpu.