Affine and momentum of BatchNorm layer

Hi,
In a specific application, I need to freeze running statistic calculation in BatchNorm layer in a part of my code, but I need to utilize “gamma(weight)” and “beta(bias)” of this layer in training with gradient forward\backward pass.
I have implemented this by building an extra BatchNorm layer with “affine” false and doing forward pass as:

base_BatchNorm = nn.BatchNorm1d(1200)
extra_BatchNorm = nn.BatchNorm1d(1200, affine=False)
x = extra_BatchNorm(x)*base_BatchNorm.weight+base_BatchNorm.bias

I am curiose if there is more clean way to implement this. I think this can be done by setting momentum=0? or by a combination of momentum=0 and affine=False for base_BatchNorm as:

base_BatchNorm.momentum = 0
base_BatchNorm.affine = false
x = base_BatchNorm(x)
base_BatchNorm.momentum = 0.1
base_BatchNorm.affine = true

Is it right?

You can call bn.eval() which will use the running stats and will not update them. The affine parameters will still be trained.

I don’t want to use running_stats in this part of my code. I want to apply zero-mean, unit_variance operation, updating affine parameters, but not updating running_stats. I think calling bn.eval() will utilize running_stats in forward pass of bachNorm layer. Is it right?

In that case you could use F.linear with the bn.weight and bn.bias during the forward pass.

You mean something like:

My issue is that, I want to apply this operation to other deep nets (resnet) which I can not directly change their forward pass methods as there are multiple use of bachNorm. Besides, I have to do this for every net and it is not possible. Can I apply this call:

for bachnorm layers without changing their code internally. Something like ‘model.apply’?

No, model.apply will recursively apply the passed method to all layers.

I think the cleanest way would be to write a custom module using your suggested behavior and replace all batchnorm layers in the model with your custom ones.

1 Like

Sounds good. Thank you.

I’m trying to freeze the bn training at a certain point, but I was surprised to find the affine parameters are updated when bn.eval(), this is by design?

Yes, as explained here:

To freeze the affine parameters (i.e. the weight and bias) you would need to call .requires_grad = False on them.

Thanks for your quick response. I certainly don’t see that in the documentation BatchNorm2d — PyTorch 1.13 documentation. So even during the my test() loop where I’ve set model.eval() the affine parameters are still being updated?!?

Yes, as already mentioned.
You are right that the docs don’t seem to explain it in detail. Would you be interested in improving these?

Sorry, to nag, but why would any parameters be updated during inference? Any why not documented?

Because calling eval() on any module is irrelevant to Autograd and only changed the behavior of some layers. This is by design to e.g. allow dropout layers to still be active during inference, which might be useful for some use cases. The affine, trainable parameters use the .requires_grad attribute to define if gradients should be computed for them or not (i.e. if these should be frozen or not).

Improvements are more than welcome. Let me know if you are interested in improving the documentation.

Ah, I see some light at the end of the tunnel!! Yes, I’d be happy to help with the documentation.

Can you help me understand a bit more? I guess I thought the actual parameter updating happened in response to backward and the optimizer.

I didn’t realize that dropout still worked during inference. I’m surprised there isn’t something like ireallyreallywanttoeval(true) that would turn off everything. I suspect that at the point that I get to an onnx model that dropout and bn are fully turned off?

I think you misunderstood me or I wasn’t clear enough.
The train() and eval() calls on a model or module will set the internal self.training flag to True or False, respectively.
This attribute changes the layer “behavior”, e.g. calling dropout.eval() will disable it, calling batchnorm.eval() will use the running stats to normalize the input activation and will not update the running stats anymore.

These calls, however, do not change the Autograd behavior and to not freeze trainable nn.Parameters. These parameters, weight and bias in batchnorm layers, are frozen by changing their requires_grad attribute.

I tried to explain why the Autograd behavior is decoupled from the train/eval behavior, since you might want to use some layers in training behavior (e.g. enabling dropout), others in eval (e.g. in batchnorm layers) while freezing some and training other parameters.

Thanks so much for the clarification. That’s in line with my understanding.

The comment I think may not be correct is this: This is by design to e.g. allow dropout layers to still be active during inference, which might be useful for some use cases. Per the documentation (Dropout — PyTorch 1.13 documentation) dropout only occurs during training, there is nothing to suggest it occurs during inference. Sorry to be pedantic, just want to make sure I understand.

Maybe what would help the documentation is an explicit callout for what train() does.

This is what I tried to clarify:

I don’t claim dropout is always on, as it depends on the self.training status. Just that Autograd is completely independent from train/eval mode and you can mix different behavior together depending on your use case.
If we forced to disable Autograd and forced to call model.eval() on all submodules, it would limit specific use cases such as explicitly allowing dropout to be used during testing.

Got it! Thanks so much for the clarification.

1 Like