Why we skip initialize running mean and running var while using pretrained resnet50?

two_Two · December 12, 2019, 3:49am

Hi everyone I am new to pytorch and there’s one issue that really confuses me

When I try to use transfer learning and take the resnet50 as base from this link

And download the weight from resnet50-19c8e357.pth

Here’s the problems:

While I am initializing the weights from pretrained model I first check if “all model’s layer match pretrained model’s key” ,so here’s what I do:

model_resnet = resnet50() #this is built completely same as  the link above
resnet50_weights = torch.load("resnet50-19c8e357.pth") 

#get all layer name from model_resnet
names = {}
for name,param in model_resnet.named_parameters():
    names[name] = 0

#to see if there's anything missing
for key in resnet50_weights:
    if key not in names:
        print(key)

And the output be like:

bn1.running_mean
bn1.running_var
layer1.0.bn1.running_mean
layer1.0.bn1.running_var
layer1.0.bn2.running_mean
layer1.0.bn2.running_var
layer1.0.bn3.running_mean
layer1.0.bn3.running_var
layer1.0.downsample.1.running_mean
layer1.0.downsample.1.running_var
layer1.1.bn1.running_mean
layer1.1.bn1.running_var
layer1.1.bn2.running_mean
layer1.1.bn2.running_var
layer1.1.bn3.running_mean
layer1.1.bn3.running_var
layer1.2.bn1.running_mean
layer1.2.bn1.running_var
layer1.2.bn2.running_mean
layer1.2.bn2.running_var
layer1.2.bn3.running_mean
layer1.2.bn3.running_var
layer2.0.bn1.running_mean
layer2.0.bn1.running_var
layer2.0.bn2.running_mean
layer2.0.bn2.running_var
layer2.0.bn3.running_mean
layer2.0.bn3.running_var
layer2.0.downsample.1.running_mean
layer2.0.downsample.1.running_var
layer2.1.bn1.running_mean
layer2.1.bn1.running_var
layer2.1.bn2.running_mean
layer2.1.bn2.running_var
layer2.1.bn3.running_mean
layer2.1.bn3.running_var
layer2.2.bn1.running_mean
layer2.2.bn1.running_var
layer2.2.bn2.running_mean
layer2.2.bn2.running_var
layer2.2.bn3.running_mean
layer2.2.bn3.running_var
layer2.3.bn1.running_mean
layer2.3.bn1.running_var
layer2.3.bn2.running_mean
layer2.3.bn2.running_var
layer2.3.bn3.running_mean
layer2.3.bn3.running_var
layer3.0.bn1.running_mean
layer3.0.bn1.running_var
layer3.0.bn2.running_mean
layer3.0.bn2.running_var
layer3.0.bn3.running_mean
layer3.0.bn3.running_var
layer3.0.downsample.1.running_mean
layer3.0.downsample.1.running_var
layer3.1.bn1.running_mean
layer3.1.bn1.running_var
layer3.1.bn2.running_mean
layer3.1.bn2.running_var
layer3.1.bn3.running_mean
layer3.1.bn3.running_var
layer3.2.bn1.running_mean
layer3.2.bn1.running_var
layer3.2.bn2.running_mean
layer3.2.bn2.running_var
layer3.2.bn3.running_mean
layer3.2.bn3.running_var
layer3.3.bn1.running_mean
layer3.3.bn1.running_var
layer3.3.bn2.running_mean
layer3.3.bn2.running_var
layer3.3.bn3.running_mean
layer3.3.bn3.running_var
layer3.4.bn1.running_mean
layer3.4.bn1.running_var
layer3.4.bn2.running_mean
layer3.4.bn2.running_var
layer3.4.bn3.running_mean
layer3.4.bn3.running_var
layer3.5.bn1.running_mean
layer3.5.bn1.running_var
layer3.5.bn2.running_mean
layer3.5.bn2.running_var
layer3.5.bn3.running_mean
layer3.5.bn3.running_var
layer4.0.bn1.running_mean
layer4.0.bn1.running_var
layer4.0.bn2.running_mean
layer4.0.bn2.running_var
layer4.0.bn3.running_mean
layer4.0.bn3.running_var
layer4.0.downsample.1.running_mean
layer4.0.downsample.1.running_var
layer4.1.bn1.running_mean
layer4.1.bn1.running_var
layer4.1.bn2.running_mean
layer4.1.bn2.running_var
layer4.1.bn3.running_mean
layer4.1.bn3.running_var
layer4.2.bn1.running_mean
layer4.2.bn1.running_var
layer4.2.bn2.running_mean
layer4.2.bn2.running_var
layer4.2.bn3.running_mean
layer4.2.bn3.running_var

As far as I know,when we freeze the pretrained model,one of issue is that we need to be careful about either using batch statistics or running mean&var that calculated while using transfer learning ,I think we should use the running mean&var as we “freeze” the model like the link below also mentioned:
The Batch Normalization layer of Keras is broken

Anyone has any idea how do we get running_mean&var?

ptrblck · December 12, 2019, 4:06am

You are only saving the parameters in these lines of code:

names = {}
for name,param in model_resnet.named_parameters():
    names[name] = 0

while the running estimates are stored as buffers.
You could append these buffers using:

for name, buf in model_resnet.named_buffers():
    names[name] = 0

and run your code again.

two_Two · December 12, 2019, 4:47am

Wow,Thanks , it works!
However as I initialize model by using

#fc layer has been deleted here
for key in resnet50_weights:
    model_resnet.state_dict()[key] = resnet50_weights[key]

And I do a little test to check initialization is done properly.

print(model_resnet.state_dict()["layer4.2.bn3.running_mean"]==resnet50_weights["layer4.2.bn3.running_mean"])

And the output be like:

tensor([False, False, False,  ..., False, False, False])

Which implies initialization is wrong ,But why does it fail?

ptrblck · December 12, 2019, 5:05am

I would recommend to load the state_dict, which will automatically map the right parameters and buffers via:

model.load_state_dict(pretrained_model.state_dict())

or do you really need to manually load each parameter and buffer?

two_Two · December 12, 2019, 5:13am

I use this to load the pretrained weights:

resnet50_weights = torch.load("resnet50-19c8e357.pth")

so

model_resnet.load_state_dict(resnet50_weights.state_dict())

would raise error:

'collections.OrderedDict' object has no attribute 'state_dict'

As for Do you really need to manually load each parameter and buffer?
Actually no, I am new to pytorch is there’s a better way to do what I want ,I would always prefer it! But for curiosity sake ,I still wonder why it fails…

ptrblck · December 12, 2019, 5:15am

Based on the error message, it seems resnet50_weight might already be the state_dict. Could you try to load them as model_resnet.load_state_dict(resnet50_weights)?

two_Two · December 12, 2019, 5:35am

Thanks!It works but what if I want to initialize the model except the fc layer ,I
I think in this case we have to assign each parameter and buffer or is there an alternative?

ptrblck · December 12, 2019, 5:36am

I would just load the complete model and just reinitialize the last layer afterwards. It seems to be simpler than the other way around.

two_Two · December 12, 2019, 5:40am

What if I want to discard the fc layer? How can I delete fc layer after using

model_resnet.load_state_dict(resnet50_weights)

I think this might work

model_resnet = nn.Sequential(*list(model_resnet.children())[:-2])

ptrblck · December 12, 2019, 5:51am

Your approach might work for this model.
However, if your model’s forward method is a bit more complicated than a simple sequence of modules, you could also set model_resnet.fc = nn.Identity().

two_Two · December 12, 2019, 8:50am

HI, just making sure if everything is right

in forward function in resnet50 I delete the flatten()

def forward(self, x):
   x = self.conv1(x)
   x = self.bn1(x)
   x = self.relu(x)
   x = self.maxpool(x)
   x = self.layer1(x)
   x = self.layer2(x)
   x = self.layer3(x)
   x = self.layer4(x)
   x = self.avgpool(x)
   #x = torch.flatten(x)
   x = self.fc(x)
       return x

And after constructing the model ,I do what you said

model_resnet = resnet50()
resnet50_weights = torch.load("resnet50-19c8e357.pth")
model_resnet.load_state_dict(resnet50_weights)
model_resnet.fc = nn.Identity()
model_resnet.avgpool = nn.Identity()

And without flatten, I think the output shape would right now be same as last residual block output shape,if you find anything wrong above,please do correct me .Thanks!

ptrblck · December 12, 2019, 3:04pm

The Identity assignment would be a hack in case you don’t want to manipulate the forward.
Since you are changing it, just comment out the calls to self.avgpool and self.fc1 and check the shape of the output after passing some random input to the model.

two_Two · December 13, 2019, 3:36pm

OK, I’ve got desired shape,Thanks!!!
BTW, I was browsing around and I found that in this link
ByteTensor to FloatTensor is slow? - PyTorch Forums
Why do people always convert tensor’s type to float while calculating, would that be faster?

ptrblck · December 13, 2019, 7:37pm

The default dtype is float32, which works faster in CUDA than float64.
Usually you don’t need the precision and range of FP64, and can stick to FP32.
However, Tensorcores in newer NVIDIA GPUs can accelerate FP16 operations.
Since the value range and precision might be too narrow for some models using this number format, we developed NVIDIA/apex containing mixed precision recipes and are currently working on upstreaming it into PyTorch in this issue.

two_Two · December 14, 2019, 5:12pm

Yeah,would definitely follow it,but what about int,are there reasons not using int?Thanks!

ptrblck · December 14, 2019, 5:31pm

I’m not familiar with training methods using int values, since the gradients would also be int, wouldn’t they?
Based on this assumption, it sounds not really useful, but I haven’t looked for the latest research papers on this topic.
Inference however is possible using int values and might give you a performance gain on specialized hardware. Have a look at Quantization to get more information.

two_Two · December 16, 2019, 3:01am

Yeah would definitely check out,And just a little to ask:The situation is I would like to freeze the pretrained model,as the link below has already showed how you can completely freeze bn layer while training:
freeze bn

But i what’s confuses me is these lines of codes:

if freeze_bn_affine:
    '''Freezing Weight/Bias of BatchNorm2D'''
    m.weight.requires_grad = False
    m.bias.requires_grad = False

So since I have already set the bn layers to eval() stage,why do I still need this?
And after this should I need to set the affine= False manually ?

ptrblck · December 16, 2019, 3:04am

train and eval will change the behavior of the running estimates (using batch stats and updating running stats in train, while using running estimates in eval).
This is unrelated to the trainable affine parameters (weight and bias), so you should freeze these parameters, if you don’t want to train them any further.

two_Two · December 16, 2019, 3:07am

So after setting weights and bias requires_grad=False I do not have to do BatchNorm2d(3,affine=False)every single time ,right?

ptrblck · December 16, 2019, 3:11am

Your line of code would re-initialize the batchnorm layer, which seems to be wrong, as you’ll lose all trained parameters and estimated stats.
So no, don’t use this line of code, just freeze the parameters using their requires_grad flag.