Using Bias=False during batch norm

Aptha_Gowda · April 21, 2020, 7:51am

#model1
def getmodel():
    model = EfficientNet.from_pretrained('efficientnet-b5')
    model._fc = nn.Sequential(
         nn.Linear(in_features=2048, out_features=1024, bias=True), #first layer 
         nn.ReLU(),
         nn.BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True),
         nn.Dropout(p=0.5),
         nn.Linear(in_features=1024, out_features=4, bias=True)) #last layer
    model = model.to(device)
    return model

I have lot of question from the above code block which I want to clarify.

I have read that bias should be True (bias=True) at the last linear layer. And my model also performed well when turned on.
Most people suggested that bias should be turned off (bias=False) before using batch norm ( Even bias in the Conv layers of EfficientNet are turned off before batch norm). But my model performed badly when I turned off the bias at the first layer. What should I follow?
This model (model2) just uses the single linear layer.

#model2
def getmodel():
   model = EfficientNet.from_pretrained('efficientnet-b5')
   model._fc = nn.Sequential(
        nn.Linear(in_features=2048, out_features=4, bias=True)) #last layer
   model = model.to(device)
   return model

Whcih approch should I follow model1 or model2 ?

albanD · April 21, 2020, 3:02pm

Hi,

The thing is that in your case, you have a ReLU between the Linear and Batchnorm. So that statement may not be true for your model.
I think that statement comes from the fact that the batchnorm will center the values. So a bias is useless in the previous layer as it will just be cancelled by the batchnorm.
model2 has much less parameters. So it will perform differently. It will depend on your dataset though which one is better.

Aptha_Gowda · April 21, 2020, 3:59pm

So just to clarify, If there is an activation layer(relu) in between then there is no need to turn off the bias in layer one. Right.

I have this doubt that linear layer at the end (from 2048 > 4 output class) learns less compared with (2048 > 1024 > 4 output class). If yes in what situation this is helpful? Why most people choose model2 like architecture during fine-tuning?

albanD · April 21, 2020, 4:15pm

This model2 is not as powerful, so it might be worst on the training set.
But because it has less parameters (and is linear) the model it is gonna learn is much simpler and so is naturally regularized (due to the structure of the function). This could be an advantage depending on your task and could lead to better validation accuracy in some cases.