From Pytorch documentation, I know that when one use DistributedDataParallel, gradients on multiple GPUs will be averaged(all-reduced), and each model’s parameters will be updated based on the grapdients. So there is no need to sync model’s parameters. But we know float tensor addition might incur presicion error. With the training going on (such as millions of minibatches), will the presicion error affect the consistency of model’s parameters on miltiple GPUs? Thank you!
I don’t think any extra step is taken on our side to avoid this. So you need to make sure that your values are well formed and can be accumulated with no problem (which should be the case if your inputs are correctly preprocessed and you use regular initialization method for your weights).