How to estimate batch normalization parameters for a separate test set or for the recently published breakthrough suggesting weight averaging leads to wider optima?


for models using batch normalization, most example code takes the mean and variance estimated during training for the evaluation of the model.

However, suppose now we have a different test set with different statistics than the training set. Or assume one wants to implement something as in the paper Averaging Weights Leads to Wider Optima and Better Generalization (see chapter 3.3). There, we need to estimate the mean and variance again by passing every sample from our dataset though the model once.

My question is if there is already a build-in functionality for this. And if not, maybe we can start a discussion how this could be achieved.

The most “primitive” idea would be to first reset all bn related parameters, freeze the neural network weights except bn parameters, then do one “training” epoch. The question now is if this has to be implemented all from scratch or if there is build-in functionality that can help with this goal. One person suggested that require_grad=False set for some variables (the input?) could be a promising start.

Question: Is the running mean and variance updated in the forward or backward pass?

Because if it’s updated in the forward pass, the algorithm becomes much simpler, even trivial. Then, we only have to iterate over the data and do a forward pass for each sample while in train mode. No need to do complicated stuff for freezing weights then.

Does anyone know this? It’s not obvious from the code since everything is hidden inside widespread C functions.

There are buffers capturing the mean and variance. If you know what the right values are, you could use .copy_ on the buffers to overwrite the training-estimated ones.
I think they are updated in the forward (they are buffers, not parameters, ha), but I recommend not to take my word for it but to check yourself.

Best regards


I see.

I think I can answer at least this question now once and for all for the community: Yes, parameters are updated in the forward pass. Even if requires_grad=False or volatile=True (which is a good thing since you can save memory then).

Here is the proof:

>>> net = nn.Sequential()
>>> net.add_module("conv",nn.Conv1d(1,1,1))
>>> net.add_module("bn",nn.BatchNorm1d(1))
>>> net[1].running_mean
>>> net(Variable(torch.ones(1,1,1)),volatile=True)
>>> net[1].running_mean

However, I still wonder if this is the correct way. As far as I can see, BN uses a fixed momentum, thus it is not invariant to the order of the sequence.
But I think for most datasets and if you use proper randomization, this should be a good enough approximation.

I think if you wanted to do this properly, you’d really have to calculate the layer-wise statistics by yourself using a fair average over your dataset and then set the buffers manually, which involves a lot of coding. For this, a build-in functionality would be really great (like fit_batchnorm_params_to_dataset or similar)

Interestingly, this approach fails spectacularly.

Here is my code where I want to adapt the batch norm statistics to my new test set:

    for module in my_model.modules():
      if type(module) == nn.BatchNorm2d:
    for it, images in enumerate(test_set_loader):
      images = Variable(images.cuda(gpu_id), volatile=True)

The result is a dramatic drop in accuracy to almost random guessing. It’s a semantic segmentation task. In contrast, when I use the batch norm statistics estimated for (and during) training, the results are much better.

Where is my logic error? Or is this exactly the effect I was worried about that a running average with momentum is not equal to a true average?

Hi, @mario98 , have you solved this problem. I also want to use the mean and sigma of BN layers estimated from target domain , not in source domain. Which is actually the AdaBN. But I don’t known how can I complete simply?
Does it work now?

Is there any new update for this question. I am also want to keep two running_mean, and variance. One set is for source domain, other one is target domain. however pytorch can only keep one running_mean, and variance for both source and target domain.

@guyue_hu @Lan1991Xu Have you finished this problem?