nn.DataParallel, we could transfer our models from single gpu to multi-gpus, Here I am puzzled about the batch norm in multi-gpus cases. Since for other operations, the samples are independent from each other, we could simply split the batch among the gpus and fuse the weight from each gpu in the backward path. That is not the same as batch norm operation which consider the batch as a whole tensor and implementing bn separately in different gpus may have different behavior from implementing bn a single gpu as a whole.
So my question is: what is nn.DataParallel` actually do with BN operations? Does it implement BN separately or does it employ some synchronizing operation ?