Hi, I’m now training a point cloud analysis model with distributed data parallel (ddp). I follow all the rules to create the ddp training. It is normal at the beginning, but the losses in different gpus donot occur consistently when it comes to the end of the 1st epoch, for example:

come to epoch: 0, step: 429, loss: 0.046092418071222385

come to epoch: 0, step: 429, loss: 0.046092418071222385

come to epoch: 0, step: 429, loss: 0.046092418071222385

come to epoch: 0, step: 429, loss: 0.046092418071222385

come to epoch: 0, step: 430, loss: 0.04677124587302715

come to epoch: 0, step: 430, loss: 0.04677124587302715

come to epoch: 0, step: 430, loss: 0.04677124587302715

come to epoch: 1, step: 0, loss: 0.04677124587302715

come to epoch: 0, step: 431, loss: 0.03822317679159113

come to epoch: 0, step: 431, loss: 0.03822317679159113

come to epoch: 1, step: 1, loss: 0.03822317679159113

come to epoch: 0, step: 431, loss: 0.03822317679159113

come to epoch: 0, step: 432, loss: 0.0431362825095357

come to epoch: 0, step: 432, loss: 0.0431362825095357

come to epoch: 0, step: 432, loss: 0.0431362825095357

come to epoch: 1, step: 2, loss: 0.0431362825095357

come to epoch: 0, step: 433, loss: 0.04170320917830233

come to epoch: 1, step: 3, loss: 0.04170320917830233

come to epoch: 0, step: 433, loss: 0.04170320917830233

come to epoch: 0, step: 433, loss: 0.04170320917830233

come to epoch: 0, step: 434, loss: 0.042295407038902666

come to epoch: 0, step: 434, loss: 0.042295407038902666

come to epoch: 1, step: 4, loss: 0.042295407038902666

come to epoch: 0, step: 434, loss: 0.042295407038902666

come to epoch: 0, step: 435, loss: 0.040262431528578634

come to epoch: 1, step: 5, loss: 0.040262431528578634

come to epoch: 0, step: 435, loss: 0.040262431528578634

come to epoch: 0, step: 435, loss: 0.040262431528578634

come to epoch: 0, step: 436, loss: 0.04188207677967013

come to epoch: 0, step: 436, loss: 0.04188207677967013

come to epoch: 0, step: 436, loss: 0.04188207677967013

come to epoch: 1, step: 6, loss: 0.04188207677967013

Can anyone help?

Thanks.