I trained a model with pytorch 1.2 on a distributed GPU system. And the training step on each code is out-sync. For example, the node 1 of the distributed system has already finished the training task on the epcho N, while node 2 just start the training step N-1. I wonder wether that the out-sync phenomenon would make negative effectiveness on model training. I will appreciate if anyone could give me some suggestion.
How did you create the distributed training procedure?
Is it a manual approach or did you use
In the latter case, synchronization points are automatically in the training process.
From the docs:
Across processes, DDP inserts necessary parameter synchronizations in forward passes and gradient synchronizations in backward passes.
Thanks for your reply. I use the latter case DDP. According to your answer, I need time to solve the out-sync problem of our GPU distributed system.