How to handle running mean and variance of BatchNorm layer in distributed settings?


I wondering how can I update running_mean and running_var under distributed master-worker cluster settings?

To be more specific, I’m using parameter server setting such that in each cluster, only one node (which is parameter server presented by this paper hold the model while other node serve as worker nodes and compute gradients. Data are evenly split among workers, i.e. each worker holds a subset of dataset. During the model update stage, parameter server gather gradients from each workers say g_1, ..., g_n with respect to each worker node, average them, and using the averaged gradients to update the “global model” (i.e. w := w - lr * 1/n \sum_{i=1}^n g_i, in which w is the global model, n represents n worker nodes, lr is the learning rate).

I figured this out and it works fine for models without BatchNorm layer e.g. LeNet. But when things come to more complex network say ResNet, during I tried to evaluate the model hold on parameter server, it performs like the model is not even trained. In my version, I only updated the BatchNorm.weights and BatchNorm.bias, which can be accessed from model.parameters() on the parameter server while leaving the running_mean and running_var unchanged on parameter server.

I think that’s why I can’t get the correct model evaluation performance on model hold by parameter server. So my question is under the foregoing setting I presented, how can I handle the running_mean and running_var on parameter server correctly? I noticed that this topic ([resolved] Synchronize BatchNorm mean and variance across gpu) mentioned something, but it’s still unclear to me

Thanks a lot.

1 Like

Hi, you may be interested in Synchronized Multi-GPU Batch Normalization