Model no longer trainable with multi-GPUs

I have a model runs on single GPU and I am trying to speed-up with multi-GPUs implemented by DataParallel. The model’s training loss can’t reach the same level as what it did in the single-GPU case anymore.
To analyze this, I’ve duplicated the model with the first one running on single-GPU and the second one running on multi-GPUs, both loaded well-trained weights. The result produced by the models were very different, compared by L2 norm. This is not expected because they have already loaded with the same weights, the only obvious difference would be batchnorm layers. I’ve done another experiment by setting the model to .eval() mode, the L2 difference goes to 0 this time!
One interesting to know is only changing the track_running_stats setting of batchnorm does not help, it has to change the model’s mode to .eval() to eliminate the difference. This suggests the rootcause might not be like I thought, but I don’t know how to confirm this.
.train() + track_running_stats=True --> has difference
.train() + track_running_stats=False --> has difference
.eval() + track_running_stats=True–> no difference
.eval() + track_running_stats=False–> has difference

I’d like to ask:

  1. except for batchnorm, what else can be affected by .eval() ? Any methods to check this?
  2. is synchronized batchnorm necessary for multi-GPUs?
  3. reduced batch size shouldn’t have a huge impact like this because the model was trainable on single-GPU if I set the batch size to half on purpose. -> is this observation make sense?

pytorch 0.4.0
ubuntu 16
python 3.5.2

Sorry for the lengthy and weird questions, appreciated for your patience.

Based on your experiments it looks like the batch statistics differ a lot between the single GPU run and the data parallel one.
Using .train() or track_running_stats=False will use the current batch estimates, which might be skewed for smaller batch sizes.

You could try SyncBatchNorm with DistributedDataParallel.