Based on my understanding, both settings should be the same, thereby returning the similar results. Unfortunately, the setting 2 doesnâ€™t freeze the bn_layer from learning at all.

However, I observe all parameters and running statistics including running_mean, running_var, weight, and bias are remained unchanged.

Any reason why the two settings are different?

Any code snippets if only weight and bias are trained? but fixing both running statistics unchanged?

No, these settings are not equivalent, as the first one will not freeze the trainable parameters (weight and bias), but will only use the internal running_mean and running_var.

I donâ€™t understand this statement. If all parameters and the running stats remain the same, what do you consider is still â€ślearningâ€ť?

No, these approaches still wouldnâ€™t be equivalent, since the batchnorm layer in setting 1 would normalize the input data using the running statistics (zero mean, unit variance by default if they werenâ€™t updated) while the layer in setting 2 would use the batch statistics, but would not update the internal stats with them.

If momentum = 0, the running statisics X_new should be equal to the pre-trained statistics X_hat, and therefore we should be able to obtain the same results for both Setting 1 and 2?

The running stats will be equal, since the momentum was set to zero and they will thus not be updated. However, the layer is still in training mode and will thus use the batch stats to normalize the input not the running stats.

In the training mode, we always use the batch statistics for normalization, but in the meantime the running statistics are updated w.r.t. the batch statistics.

However, it the CNN is set to the evaluation mode, only the running statistics are used for both training and testing stages.

Yes, your understanding is correct.
In the default setup:

training: normalize input with input_stats and update running_stats with input_stats using momentum formula

evaluation: use running_stats to normalize input

You can of course change this behavior by setting track_running_stats=False, which would also use the input_stats during evaluation.
The affine parameters are independent from this behavior.

Yes, you could call .eval() on the batchnorm layers to use the running stats.
Note however, that these stats wonâ€™t normalize the input if they were not updated, since the running_mean is initialized with zeros, while the running_var with ones.

For the pre-learned BN layers, would it be possible to force the model to use the running_stats for normalization during fine-tuning (in the training mode), instead of input_stats? Specifically,

training - normalize input with running_stats and update running_stats with input_stats using momentum formula

Hi. I just completed some experiments that if .evalI() is set during training, the running_stats will no longer be updated based on the input_stats accordingly.