DataParallel results in a different network compared to a single GPU run

I think the difference might come from the gradient reduction of all device gradients after the backward pass.
Here is an overview of the DataParallel algorithm.
Unfortunately, I can’t test it right now, but from my understanding after the parallel_apply of model’s backward pass, the gradients will be reduced and thus accumulated on the default device.

In a general sense, you are using a larger batch size for the same model, which also might change the training behavior. While the parameters of the model are usually updated after each batch iteration, you are using the same parameters for multiple batches (on the different GPU).

Let me know, if that makes sense or if I’m completely mistaken.

As I can’t test it right now, please take this statement with a grain of salt! :wink:

1 Like