Data Parallel multi-gpu training results in a reduction of performance and very noisy learning curves and losses

Hi, I am currently experiencing the problem that when I go from single-gpu to multi-gpu training the performance degrades severely. Also I’ve noticed that the learning curves and losses are much noisier. The only difference between the two setups in the script is the value of the environmental variable CUDA_VISIBLE_DEVICES. I am using torch.nn.DataParallel to parallelize the model as in https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html.

What could be the reason of this highly puzzling difference between training with 1 gpu or > 1 gpu? Or what can I do to isolate the source of the problem?

Thanks in advance

Any update on this issue ? Have you solved the problem ? I’m facing with the same issue currently.