Training on Distributed GPU System by Pytorch With Node Asynchronization Problem

Cooper_Lee · October 14, 2019, 3:43am

I trained a model with pytorch 1.2 on a distributed GPU system. And the training step on each code is out-sync. For example, the node 1 of the distributed system has already finished the training task on the epcho N, while node 2 just start the training step N-1. I wonder wether that the out-sync phenomenon would make negative effectiveness on model training. I will appreciate if anyone could give me some suggestion.

ptrblck · October 14, 2019, 5:27am

How did you create the distributed training procedure?
Is it a manual approach or did you use nn.DistributedDataParallel?
In the latter case, synchronization points are automatically in the training process.
From the docs:

Across processes, DDP inserts necessary parameter synchronizations in forward passes and gradient synchronizations in backward passes.

Cooper_Lee · October 14, 2019, 5:57am

Thanks for your reply. I use the latter case DDP. According to your answer, I need time to solve the out-sync problem of our GPU distributed system.