Batch norm parameters not included in model.parameters()

Yozey · November 22, 2017, 5:56pm

Hello,

I found that batch normalization parameters such as running_mean and running_var are not included in model.parameters(). However, they appear in the model[‘state_dict’]. I would like to know if it is true or that might be a mistake of mine.

Thank you.

Kind regards

zazzyy · November 22, 2017, 6:54pm

Hi @Yozey

I think you’re right here by running_mean and running_var included in model.state_dict() rather than model.parameters().
My understanding is running_mean and running_var are just stat data extracted from a particular batch of data points, but during the model update phase i.e. using gradients calculated to update the model, those stat data won’t be updated. In this case, model.parameters() only contains those parameters which will be “trained” during the model training process.
Actually, by checking optimizer usage, you can get similar conclusion:

optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

Hope this can help you a bit.

Yozey · November 23, 2017, 10:36am

Hi @zazzyy,

Thank you very much for your kind response. I totally agree with you that the the running_mean and the running_var are calculated by batch statistics during training.
But during test time, we use the pop_mean and pop_var for test, which represent the entire dataset statistics. Please refer to the training step2 in the answer of Le Quang Vu here. The pop_mean and pop_var get updated during the training stage as well.
I suppose that the pop_mean and pop_var here are equivalent to running_mean and running_var in pytorch so that they need to be updated?

Thank you in advance for any help.

zazzyy · November 23, 2017, 6:46pm

Hi @Yozey

Based on the pointer you provided, in TensorFlow pop_mean and pop_var are updated adaptive during the model training step (batch by batch) based on batch_mean and batch_var for current batch with some decay (say 0.99). And during the test step, pop_mean/var will be used directly for model evaluation.

Based on my understanding, PyTorch dose exactly the same thing, which we can know by reading this part of comment (https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/batchnorm.py#L107-L110). If I’m correct, the momentum is used as the decay factor in PyTorch. If you delve a bit into the source code, you can find here (https://github.com/pytorch/pytorch/blob/2502ac082b74bdcbe95826ecb56d1416b344ef0d/torch/csrc/autograd/functions/batch_normalization.cpp#L66-L93), running_mean/var calculated for a particular batch are first saved in some place i.e. saved_mean/var and used in further forward function. Thus, it’s probably the case that in the forward function, saved_mean/var will be aggregated into running_mean/var with decay.
Therefore, everything till now is the same in TensorFlow. My thinking is you don’t need to update running_mean/var manually, PyTorch will do that for you.

Hope this one helps
Thanks

Yozey · November 24, 2017, 1:30pm

Thank you @zazzyy very much for your kind and detailed answer. It’s really helpful.

Maybe I haven’t well-explained in my first post. In fact, what I would like to do here is freezing several modules of a pre-trained model and pass other modules (including a resnet block with batch norm layer) to optimizers to train. However, when I checked the parameters passed to optimizer, I found that running_mean and running_average are not included.

That’s why I’am wondering if the running_mean and running_average could be updated during the training.

I supposed that pytorch will still update the running_mean and running_average even though they are not explicitly passed to optimizers.