How to handle running mean and variance of BatchNorm layer in distributed settings?

zazzyy · November 9, 2017, 5:28am

Hi,

I wondering how can I update running_mean and running_var under distributed master-worker cluster settings?

To be more specific, I’m using parameter server setting such that in each cluster, only one node (which is parameter server presented by this paper https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf) hold the model while other node serve as worker nodes and compute gradients. Data are evenly split among workers, i.e. each worker holds a subset of dataset. During the model update stage, parameter server gather gradients from each workers say g_1, ..., g_n with respect to each worker node, average them, and using the averaged gradients to update the “global model” (i.e. w := w - lr * 1/n \sum_{i=1}^n g_i, in which w is the global model, n represents n worker nodes, lr is the learning rate).

I figured this out and it works fine for models without BatchNorm layer e.g. LeNet. But when things come to more complex network say ResNet, during I tried to evaluate the model hold on parameter server, it performs like the model is not even trained. In my version, I only updated the BatchNorm.weights and BatchNorm.bias, which can be accessed from model.parameters() on the parameter server while leaving the running_mean and running_var unchanged on parameter server.

I think that’s why I can’t get the correct model evaluation performance on model hold by parameter server. So my question is under the foregoing setting I presented, how can I handle the running_mean and running_var on parameter server correctly? I noticed that this topic ([resolved] Synchronize BatchNorm mean and variance across gpu) mentioned something, but it’s still unclear to me

Thanks a lot.

HANG_ZHANG · November 9, 2017, 2:37pm

Hi, you may be interested in Synchronized Multi-GPU Batch Normalization http://hangzh.com/PyTorch-Encoding/syncbn.html