Hello. I hope you are very well.
I am finalizing my experiment with pytorch. When I finish my paper, I hope I can share my paper in here.
Anyway, is there any detailed documentation about data parallel(dp) and distributed data parallel(ddp)
During my experiment, DP and DDP have big accuracy difference with same dataset, network, learning rate, and loss function. I hope I can put this experiment results in my paper but my professor asks the detailed explanation of why it happens. My dataset is a very unique image dataset and it is not a normal object such as imagenet or city scape stuff, so it can be a very different result than usual computer science paper. In this reason, I look around and read some articles.
https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html
https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/
However, I am still confused about this two different multi gpu training strategies.
- What is the “reduce” mean. The “reduce” is the weight update or loss reduction.
- What is the major difference between DP and DDP in the weight update strategy? I think this is important.
- DDP affects the batch normalization (BN) or DDP still needs the synchronized BN.
Thank you for reading my question.