Finetuning intermediate layers of resnet18

Ali_Mirzaeyan · March 16, 2019, 1:06pm

Hi,
I changed resnet-18 model and I want to load weights from pretrained-models only for layer3 and layer4 of resnet-18. is the following piece of code, right way to do it ?

    base_model = 'pretrainedmodels/resnet18-5c106cde.pth'
    model = net.ResNet(net.BasicBlock, [2,2,2,2], num_classes=1000).cuda() if params.cuda else net.ResNet(net.BasicBlock, [2,2,2,2], num_classes=1000)
    model.load_state_dict(torch.load(base_model))
  
    base_model = '/pretrainedmodels/resnet18-5c106cde.pth'
    other_model = net.ResNet(net.BasicBlock, [2,2,2,2], num_classes=1000).cuda() if params.cuda else net.ResNet(net.BasicBlock, [2,2,2,2], num_classes=1000)
    other_model.load_state_dict(torch.load(base_model))
  
    model.layer3 = net.Inception()
    model.layer3.paral_0[0].load_state_dict(other_model.layer3[0].state_dict())
    model.layer3.paral_0[1].load_state_dict(other_model.layer3[1].state_dict())

    model.layer4 = net.Inception1()
    model.layer4.paral_0[0].load_state_dict(other_model.layer4[0].state_dict())
    model.layer4.paral_0[1].load_state_dict(other_model.layer4[1].state_dict())
   ....

side note: I defined incpetion() and inception1() exactly same as layer3 and layer4 of resnet-18 respectively. so there is not dimension mismatch between them(as shown in the following)

Inception(
(paral_0): Sequential(
(0): BasicBlock0(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(128, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock1(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)

Inception1(
(paral_0): Sequential(
(0): BasicBlock2(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(128, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock3(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)

ptrblck · March 16, 2019, 11:42pm

The code looks alright.
To make sure everything was loaded properly, you could also print (some) values from model.layer3/4 and the corresponding other_model layers.

So save some memory, you could also store the already loaded parameters from model and reload them later instead of creating other_model, but your code should work anyway.

Ali_Mirzaeyan · March 17, 2019, 4:09pm

actually it was part of my issue, here is the story: I’m trying to cluster the output of trained resnet-18 let say into 10 clusters. then based on the clusters, I want to change the structure of resnet in a way that it has layer1 and layer2 shared between clusters and layer2 and layer3 be specialized to each clusters. in other words, the new network is look like this:

for start each one of the cluster1 … clusterN has exact same structure like layer2 back to back layer3 of resnet. I will load pretrained model of resnet and freeze everything except FC layer of each cluster at the end put each FC layer output at the correct position(because the clustering algorithm takes different indexes for each cluster so I need to adjust the final result) of and feed the 1000 class into sofmax. but I don’t know why my accuracy is around 0.01 after 400 epochs. here is the piece of code:

class Identity(nn.Module):
def init(self):
super(Identity, self).init()

def forward(self, x):
    return x

class Inception(nn.Module):

def __init__(self, in_channels=2048):
    super(Inception, self).__init__()		
    self.paral_0 = nn.Sequential(
        BasicBlock0(128, 128),
        BasicBlock1(128, 128),			
    )      
    
    self.paral_1 = nn.Sequential(
        BasicBlock0(128, 128),
        BasicBlock1(128, 128),
    )

           

    self.paral_8 = nn.Sequential(
        BasicBlock0(128, 128),
        BasicBlock1(128, 128),

    )


    self.paral_9 = nn.Sequential(
        BasicBlock0(128, 128),
        BasicBlock1(128, 128),


    )

            

def forward(self, x):

    y0 = self.paral_0(x)

    
    y1 = self.paral_1(x)

…

    y8 = self.paral_8(x)

            
    y9 = self.paral_9(x)

    
    return {0:y0, 1:y1, 2:y2, 3:y3, 4:y4, 5:y5, 6:y6, 7:y7, 8:y8, 9:y9}

class Inception1(nn.Module):

def __init__(self, in_channels=2048):
    super(Inception1, self).__init__()

    self.paral_0 = nn.Sequential(
        BasicBlock2(128, 128),
        BasicBlock3(128, 128),			
        nn.AdaptiveAvgPool2d(output_size=(1, 1))
    )
    self.fc0    = nn.Linear(in_features=128, out_features=83, bias=True)        
    
    self.paral_1 = nn.Sequential(
        BasicBlock2(128, 128),
        BasicBlock3(128, 128),
        nn.AdaptiveAvgPool2d(output_size=(1, 1))
    )
    self.fc1    = nn.Linear(in_features=128, out_features=143, bias=True)
    
    
    self.paral_2 = nn.Sequential(
        BasicBlock2(128, 128),
        BasicBlock3(128, 128),
        nn.AdaptiveAvgPool2d(output_size=(1, 1))
    )
    self.fc2    = nn.Linear(in_features=128, out_features=41, bias=True)

…

    self.paral_9 = nn.Sequential(
        BasicBlock2(128, 128),
        BasicBlock3(128, 128),
        nn.AdaptiveAvgPool2d(output_size=(1, 1))
    )
    self.fc9    = nn.Linear(in_features=128, out_features=85, bias=True)
            

def forward(self, x):

    
    y0 = self.paral_0(x[0])
    y0 = y0.view(y0.size(0), -1)
    y0 = self.fc0(y0)
    
    y1 = self.paral_1(x[1])
    y1 = y1.view(y1.size(0), -1)	
    y1 = self.fc1(y1)

…

    y8 = self.paral_8(x[8])
    y8 = y8.view(y8.size(0), -1)
    y8 = self.fc8(y8)
            
    y9 = self.paral_9(x[9])
    y9 = y9.view(y9.size(0), -1)
    y9 = self.fc9(y9)

    c0 = [339, 345, 407, 408, 412, 413, 427, 428, 436, 444, 450, 466, 468,
                 471, 475, 479, 489, 491, 498, 511, 519, 535, 547, 555, 561, 565,
                 569, 571, 573, 575, 581, 583, 586, 595, 603, 609, 612, 621, 627,
                 634, 637, 654, 656, 660, 661, 665, 670, 671, 675, 690, 703, 704,
                 705, 717, 730, 734, 751, 756, 757, 758, 779, 781, 791, 792, 795,
                 802, 803, 817, 820, 829, 830, 847, 856, 864, 866, 867, 870, 874,
                 877, 880, 919, 920, 981]

    c1 = [398, 409, 418, 419, 422, 426, 446, 447, 451, 453, 455, 458, 464,
          473, 477, 478, 480, 481, 482, 485, 487, 492, 499, 506, 507, 508,
          512, 518, 526, 527, 528, 530, 531, 534, 542, 543, 545, 548, 549,
          550, 551, 553, 563, 584, 585, 587, 589, 590, 592, 593, 598, 600,
          605, 606, 613, 620, 622, 623, 626, 629, 631, 632, 633, 635, 640,
          644, 648, 650, 651, 662, 664, 673, 674, 680, 681, 685, 686, 688,
          692, 695, 696, 700, 707, 709, 710, 711, 713, 714, 719, 726, 729,
          732, 740, 742, 745, 749, 753, 754, 759, 760, 761, 763, 767, 769,
          771, 772, 778, 782, 783, 784, 786, 798, 800, 804, 810, 811, 818,
          823, 826, 827, 844, 845, 848, 851, 855, 859, 861, 872, 878, 881,
          882, 886, 891, 892, 893, 896, 897, 898, 902, 916, 918, 921, 922]

…

    c9 = [403, 404, 405, 417, 421, 425, 437, 442, 448, 449, 460, 472, 483,
          484, 497, 500, 510, 517, 525, 536, 538, 540, 554, 557, 562, 576,
          580, 625, 628, 645, 646, 649, 657, 663, 668, 672, 682, 693, 694,
          698, 701, 706, 708, 716, 718, 724, 727, 733, 744, 755, 780, 807,
          812, 814, 821, 825, 832, 833, 835, 839, 853, 858, 863, 871, 873,
          884, 888, 895, 900, 908, 912, 913, 914, 915, 958, 970, 972, 974,
          975, 976, 977, 978, 979, 980, 984]		

    final = torch.zeros(x[0].size(0), 1000)

    for i, j in enumerate(c0):
        final[:, j] = y0[:, i] 

    for i, j in enumerate(c1):
        final[:, j] = y1[:, i]

    for i, j in enumerate(c2):
        final[:, j] = y2[:, i]

…

    for i, j in enumerate(c9):
        final[:, j] = y9[:, i]
        

    return final

and this is part of training loop:

    base_model = 'base_resnet152/pretrainedmodels/resnet18-5c106cde.pth'
    model = net.ResNet(net.BasicBlock, [2,2,2,2], num_classes=1000).cuda() if params.cuda else net.ResNet(net.BasicBlock, [2,2,2,2], num_classes=1000)
    model.load_state_dict(torch.load(base_model))
  
    base_model = 'base_resnet152/pretrainedmodels/resnet18-5c106cde.pth'
    other_model = net.ResNet(net.BasicBlock, [2,2,2,2], num_classes=1000).cuda() if params.cuda else net.ResNet(net.BasicBlock, [2,2,2,2], num_classes=1000)
    # utils.load_checkpoint(base_model, model)
    other_model.load_state_dict(torch.load(base_model))
  
    model.layer3 = net.Inception()
    model.layer3.paral_0[0].load_state_dict(other_model.layer3[0].state_dict())
    model.layer3.paral_0[1].load_state_dict(other_model.layer3[1].state_dict())

…
model.layer3.paral_9[0].load_state_dict(other_model.layer3[0].state_dict())
model.layer3.paral_9[1].load_state_dict(other_model.layer3[1].state_dict())

    model.avgpool = net.Identity()
    model.fc = net.Identity()
    model.layer4 = net.Inception1()
    model.layer4.paral_0[0].load_state_dict(other_model.layer4[0].state_dict())
    model.layer4.paral_0[1].load_state_dict(other_model.layer4[1].state_dict())

…

    model.layer4.paral_9[0].load_state_dict(other_model.layer4[0].state_dict())
    model.layer4.paral_9[1].load_state_dict(other_model.layer4[1].state_dict())

  	
  		
    for child in model.children():
        for param in child.parameters():
            param.requires_grad = False


    model.layer4.fc0.weight.requires_grad = True
    model.layer4.fc0.bias.requires_grad = True
    model.layer4.fc1.weight.requires_grad = True
    model.layer4.fc1.bias.requires_grad = True		
    model.layer4.fc2.weight.requires_grad = True
    model.layer4.fc2.bias.requires_grad = True		
    model.layer4.fc3.weight.requires_grad = True
    model.layer4.fc3.bias.requires_grad = True		
    model.layer4.fc4.weight.requires_grad = True
    model.layer4.fc4.bias.requires_grad = True		
    model.layer4.fc5.weight.requires_grad = True
    model.layer4.fc5.bias.requires_grad = True		
    model.layer4.fc6.weight.requires_grad = True
    model.layer4.fc6.bias.requires_grad = True		
    model.layer4.fc7.weight.requires_grad = True
    model.layer4.fc7.bias.requires_grad = True		
    model.layer4.fc8.weight.requires_grad = True
    model.layer4.fc8.bias.requires_grad = True		
    model.layer4.fc9.weight.requires_grad = True	
    model.layer4.fc9.bias.requires_grad = True
    
  		
    if params.cuda:
        model = model.cuda()            

    optimizer = optim.Adam(model.parameters(), lr=params.learning_rate)
    ...

I’m not sure whether the way I managed the output of clusters and adjust them into the “final” tensor could be problematic from computational graph and back propagation points of view. it is very strange that when I trained each class separately(with one extra class that represents other classes that are not in that particular cluster), unlike this scenario that all 10 cluster are parallel, I will get much better accuracy per cluster. However when I combined them to build a model with 1000 output I will get a dramatic accuracy loss. basically that was why I went toward training the clusters in parallel instead of training separately then join them. Anyways, in the above piece of code, if we trace the path from input to output of each cluster, each one is as big as original resnet-18, then why on the earth I get that poor accuracy. I tried different optimizer, different learning rate, different scheduler, … , neither changed the result.

ptrblck · March 17, 2019, 5:34pm

Could you explain the other approach a bit more?
Before your current architecture you trained each cluster separately, such that you got 10 different models or 10 different outputs?
Was the previous architecture comparable to a “one-vs-all” approach?

The creation of final shouldn’t be an issue. Otherwise Autograd will throw an error.

Ali_Mirzaeyan · March 17, 2019, 7:19pm

let say we have 10 clusters and each cluster has 100 class in it,
first approach:
I selected 101 output for each cluster(100 for classes that included and 1 for the others) and trained the model. and I got good accuracy per each cluster( i also solve the unbiased problem of dataset). However, for test phase, I remove the extra class and concat the output of 10 cluster that lead to 1000 output. then I got almost almost zero accuracy. then I decided to go toward next approach:
second approach:
unlike the previous method I select only 100 class for each cluster( I don’t have the one extra class anymore) and train each clusters simultaneously, unlike the previous method that each clusters trained separately and concat later. Then I see the accuracy is around 0.01 and not changing. The thing is each one of the clusters’ model are as big as resnet(if you trace the path that each input will pass to get output, let say cluster1, the path is : General + Cluster1+ softmax. in this path General+cluster1 has the same size as the resnet with one difference that it finetuned for cluster1 which has 100 output class). So I’m thinking something is goes wrong at the backpropagation phase