Should their be any difference in results when running DataParallel?

I have ran a few models using DataParallel, just by simply passing the model to nn.DataParallel() and I have had a range of strange things that happen sometimes. What I see commonly ,is my graphs (matplotlib, etc) will look different, like WAY different, which makes me think there is something being computed differently. Other times, I get warnings or errors. I see most of the problems with LSTM’s, and in searching I saw a few bugs with LSTM’s and DataParallel. Is it / should it be pretty safe to run things using DataParallel or is it unstable? I am running 1.5.0a0+8f84ded, which I know there are newer versions, but I am running the NVIDIA NGC image and they haven’t released a new one. I did update pytorch via pip just to try it out and still had some of the issues so I went back to 1.5.0a0+8f84ded. Many times models don’t seem all that much faster with DataParallel, not sure if there are any tips to make things faster. I do use pin_memory and calibrate my dataloader num_workers to find what is fastest.

The training might differ generally, as the gradients will be reduced from all devices to the default device, which would be equivalent to training with a bigger batch size.
Also, if you are using batchnorm layers, their running estimates won’t be synchronized in nn.DataParallel, which might also make a difference.

We recommend to use DistributedDataParallel with a single process per GPU for the best performance. DDP also supports SyncBatchNorm, which might be beneficial.

1 Like