Implementing Batchnorm in Pytorch. Problem with updating self.running_mean and self.running_var

It depends a bit which implementation you are using when you mention the “native” implementation.
E.g. since you are using the GPU, cudnn would be used, which would provide fast algorithms for the batchnorm operations. To disable it, you could use torch.backends.cudnn.enabled = False and compare the speed again to your custom implementation to see, if you would be inside the desired 5-10% performance drop window.

I tried the manual implementation of Batch Normalization. However, the training accuracy seems to fail when I use it. I am not sure what the cause of this maybe.
I looked at the running mean and variance, and it the tensor is all zeros.
Thanks for your help in advance.

I don’t know which manual implementation you are using, but my reference should update the running stats properly.

In my case, I use multipy nn.BatchNorm1d to construct several domain specific BN, and I design the dataloader to allocate the data from different domains to different gpus, such that different BNs get data from different gpus, e.g. ‘gpu1: bn1, gup2: bn2, …, gpu_n: bn_n’. After training, the bn.weight and bn.bias seems update properly, but the only the running_mean (or running_var) of the first bn (on gpu:0) was updated, the mean and var of bn_2, …, bn_n are entirely not updated. How can I solve this problem? Any advices? Thanks

I think you could use SyncBatchNorm to synchronize the stats.

Thank you for your quickly reply. The loss and data I employed is not friendly for DDP training, and I use DP for multi-gpu training. Thus SyncBatchNorm seems not work for me. As you say,

Is there any functions in for DP to synchronize buffer on all device?

I think you are right and nn.DataParallel is not compatible with SyncBatchNorm and I don’t know how you could synchronize the stats as the models will be copied from the default device in each iteration. Maybe changing the momentum might help.