Hi,
I wondering how can I update running_mean
and running_var
under distributed master-worker cluster settings?
To be more specific, I’m using parameter server setting such that in each cluster, only one node (which is parameter server presented by this paper https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf) hold the model while other node serve as worker nodes and compute gradients. Data are evenly split among workers, i.e. each worker holds a subset of dataset. During the model update stage, parameter server gather gradients from each workers say g_1, ..., g_n
with respect to each worker node, average them, and using the averaged gradients to update the “global model” (i.e. w := w - lr * 1/n \sum_{i=1}^n g_i, in which w
is the global model, n
represents n worker nodes, lr
is the learning rate).
I figured this out and it works fine for models without BatchNorm
layer e.g. LeNet. But when things come to more complex network say ResNet, during I tried to evaluate the model hold on parameter server, it performs like the model is not even trained. In my version, I only updated the BatchNorm.weights
and BatchNorm.bias
, which can be accessed from model.parameters()
on the parameter server while leaving the running_mean
and running_var
unchanged on parameter server.
I think that’s why I can’t get the correct model evaluation performance on model hold by parameter server. So my question is under the foregoing setting I presented, how can I handle the running_mean
and running_var
on parameter server correctly? I noticed that this topic ([resolved] Synchronize BatchNorm mean and variance across gpu) mentioned something, but it’s still unclear to me
Thanks a lot.