Training on distributed GPU system is out of sync

Cooper_Lee · October 25, 2019, 3:12am

We use a distributed GPU system with four v100 GPUs for each node to train our model through pytorch. We follow the demo of pytorch used on distributed GPU system. As we all know, each node has a copy of model. However, there is a great time interval for copy on different node. I wonder wether anyone else has this problem. Please give me some advice for this issue.