Train BatchNorm2D with Training and Evaluation Modes

wwaayyaaww · January 18, 2021, 4:57pm

Hi. I am wondering if it is possible to train the batchnorm2d layer using a very low learning rate?

Assuming that we are to train an CNN with a batchnorm2d layer, i.e. bn_layer.

To freeze the bn_layer from learning in the training stage, are the following two settings equivalent?

Setting 1 : let the bn_layer in evaluation mode

bn_layer.eval()

Setting 2 : let the bn_layer in training mode

bn_layer.train()
bn_layer.momentum = 0
bn_layer.weight.requires_grad = False
bn_layer.biast.requires_grad = False

Based on my understanding, both settings should be the same, thereby returning the similar results. Unfortunately, the setting 2 doesn’t freeze the bn_layer from learning at all.

However, I observe all parameters and running statistics including running_mean, running_var, weight, and bias are remained unchanged.

Any reason why the two settings are different?

Any code snippets if only weight and bias are trained? but fixing both running statistics unchanged?

Thank you.

ptrblck · January 22, 2021, 8:46am

No, these settings are not equivalent, as the first one will not freeze the trainable parameters (weight and bias), but will only use the internal running_mean and running_var.

I don’t understand this statement. If all parameters and the running stats remain the same, what do you consider is still “learning”?

wwaayyaaww · January 22, 2021, 9:29am

Thank you for your response.

How if the following two settings? are they equivalent?

Setting 1 :

bn_layer.eval()
bn_layer.weight.requires_grad = False
bn_layer.biast.requires_grad = False

Setting 2 :

bn_layer.train()
bn_layer.momentum = 0
bn_layer.weight.requires_grad = False
bn_layer.biast.requires_grad = False

My goal is to explore the momentum such that the bn_layer only peforms very minimal learning.

However, following Setting 2, the performance drops drastiscally even setting momentum = 0.00001.

Thank you.

ptrblck · January 22, 2021, 9:41am

No, these approaches still wouldn’t be equivalent, since the batchnorm layer in setting 1 would normalize the input data using the running statistics (zero mean, unit variance by default if they weren’t updated) while the layer in setting 2 would use the batch statistics, but would not update the internal stats with them.

wwaayyaaww · January 22, 2021, 10:01am

Assuming that I am fine-tuning a pre-learned model trained with running statistics.

Based on BatchNorm2d — PyTorch 2.1 documentation

If momentum = 0, the running statisics X_new should be equal to the pre-trained statistics X_hat, and therefore we should be able to obtain the same results for both Setting 1 and 2?

ptrblck · January 22, 2021, 10:17am

The running stats will be equal, since the momentum was set to zero and they will thus not be updated. However, the layer is still in training mode and will thus use the batch stats to normalize the input not the running stats.

wwaayyaaww · January 22, 2021, 10:50am

I think I got your point…!

In the training mode, we always use the batch statistics for normalization, but in the meantime the running statistics are updated w.r.t. the batch statistics.

However, it the CNN is set to the evaluation mode, only the running statistics are used for both training and testing stages.

Hmmmmmm may I know if my understand is correct?

Thank you.

ptrblck · January 22, 2021, 10:53am

Yes, your understanding is correct.
In the default setup:

training: normalize input with input_stats and update running_stats with input_stats using momentum formula
evaluation: use running_stats to normalize input

You can of course change this behavior by setting track_running_stats=False, which would also use the input_stats during evaluation.
The affine parameters are independent from this behavior.

wwaayyaaww · January 22, 2021, 11:44am

Thank you, and I truly appreciate your very clear explanation.

Would it be possible to use running_stats instead of input_stats for normalization during the training stage?

Because I am dealing with the training data of differrent domains, and therefore the direct use of input_stats harms the performance directly.

Thank you.

ptrblck · January 22, 2021, 7:23pm

Yes, you could call .eval() on the batchnorm layers to use the running stats.
Note however, that these stats won’t normalize the input if they were not updated, since the running_mean is initialized with zeros, while the running_var with ones.

wwaayyaaww · January 22, 2021, 7:30pm

For the pre-learned BN layers, would it be possible to force the model to use the running_stats for normalization during fine-tuning (in the training mode), instead of input_stats? Specifically,

training - normalize input with running_stats and update running_stats with input_stats using momentum formula

I am sorry for my confused sentences.

ptrblck · January 22, 2021, 7:37pm

Yes, you would have to call .eval() on these batchnorm layers and the running (pretrained) stats will be used.

wwaayyaaww · January 22, 2021, 7:38pm

Ah i got it completely…!

Thank you very much for your patient. :"))

wwaayyaaww · January 23, 2021, 6:28am

Hi. I just completed some experiments that if .evalI() is set during training, the running_stats will no longer be updated based on the input_stats accordingly.

Is my observation correct?

ptrblck · January 23, 2021, 7:01am

Yes, this is described here. During eval() only the running stats will be applied.