How to load a pretrained model that uses Sequential?

wmd · April 30, 2024, 12:30am

I have trained a model that uses Sequential. I now want to load that model to use as pretrained weights in a new model. Both the trained model and the model I am about to train have the same model definition. This is how I save the model:

torch.save(model.state_dict(), 'my_model')

Standard loading does not work:

weight_file = 'weights/my_model.pth'
weights = torch.load(weight_file)
self.load_state_dict(weights)

This gives the error:

Unexpected key(s) in state_dict
Missing key(s) in state_dict

I know that the popular solution to state_dict issues is to do this:

weight_file = 'weights/my_model.pth'
weights = torch.load(weight_file)
state_dict = weights['state_dict']
new_state_dict = OrderedDict()
for k, v in state_dict.items():
  name = k[7:] # remove `module.`
  new_state_dict[name] = v
  self.load_state_dict(new_state_dict)

This gives the error:

KeyError: 'state_dict'

Does anyone have an example of how I can load my model correctly?

ptrblck · April 30, 2024, 1:31am

nn.Sequential models should not cause any issues in saving and loading state_dicts. Could you post a minimal and executable code snippet reproducing the issue?

wmd · April 30, 2024, 1:34am

A minimal example might be difficult to do. However, this is the model that I am using:

https://github.com/kytimmylai/DFUC2022/blob/main/lib/kingnet.py

wmd · April 30, 2024, 1:41am

Note that in that code, ImageNet-1K weights are loaded. The weights are loaded without error. However, when I change the weights to a model I have trained using the same model definition, I get the errors I posted in the OP.

ptrblck · April 30, 2024, 2:03pm

Could you post the model creation causing the issue?

wmd · April 30, 2024, 2:06pm

Do you mean the model definition? If so, is this sufficient? →

(backbone): KingNet(
    (base): ModuleList(
      (0): ConvLayer(
        (conv): Conv2d(4, 30, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (ibn): IBN(
          (IN): InstanceNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
          (BN): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (prelu): PReLU(num_parameters=1)
      )
      (1): ConvLayer(
        (conv): Conv2d(30, 60, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (sn): SwitchNorm2d()
        (prelu): PReLU(num_parameters=1)
      )
      (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (3): KingBlock(
        (layers): ModuleList(
          (0-9): 10 x ConvLayer(
            (conv): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (10): ConvLayer(
            (conv): Conv2d(60, 60, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(60, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
        )
      )
      (4): SELayer(
        (avg_pool): AdaptiveAvgPool2d(output_size=1)
        (fc): Sequential(
          (0): Linear(in_features=150, out_features=7, bias=False)
          (1): ReLU(inplace=True)
          (2): Linear(in_features=7, out_features=150, bias=False)
          (3): Sigmoid()
        )
      )
      (5): ConvLayer(
        (conv): Conv2d(150, 120, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (prelu): PReLU(num_parameters=1)
      )
      (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (7): KingBlock(
        (layers): ModuleList(
          (0-9): 10 x ConvLayer(
            (conv): Conv2d(60, 60, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(60, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (10): ConvLayer(
            (conv): Conv2d(120, 120, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
        )
      )
      (8): SELayer(
        (avg_pool): AdaptiveAvgPool2d(output_size=1)
        (fc): Sequential(
          (0): Linear(in_features=300, out_features=15, bias=False)
          (1): ReLU(inplace=True)
          (2): Linear(in_features=15, out_features=300, bias=False)
          (3): Sigmoid()
        )
      )
      (9): ConvLayer(
        (conv): Conv2d(300, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm): BatchNorm2d(240, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (prelu): PReLU(num_parameters=1)
      )
      (10): KingBlock(
        (layers): ModuleList(
          (0): ConvLayer(
            (conv): Conv2d(40, 40, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (1-2): 2 x ConvLayer(
            (conv): Conv2d(80, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(80, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (3): ConvLayer(
            (conv): Conv2d(120, 120, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (4): ConvLayer(
            (conv): Conv2d(40, 40, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (5): ConvLayer(
            (conv): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (6): ConvLayer(
            (conv): Conv2d(40, 40, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (7): ConvLayer(
            (conv): Conv2d(120, 120, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (8-9): 2 x ConvLayer(
            (conv): Conv2d(80, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(80, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (10): ConvLayer(
            (conv): Conv2d(40, 40, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (11): ConvLayer(
            (conv): Conv2d(240, 240, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(240, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
        )
      )
      (11): SELayer(
        (avg_pool): AdaptiveAvgPool2d(output_size=1)
        (fc): Sequential(
          (0): Linear(in_features=560, out_features=28, bias=False)
          (1): ReLU(inplace=True)
          (2): Linear(in_features=28, out_features=560, bias=False)
          (3): Sigmoid()
        )
      )
      (12): ConvLayer(
        (conv): Conv2d(560, 540, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm): BatchNorm2d(540, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (prelu): PReLU(num_parameters=1)
      )
      (13): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (14): KingBlock(
        (layers): ModuleList(
          (0-9): 10 x ConvLayer(
            (conv): Conv2d(270, 270, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(270, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (10): ConvLayer(
            (conv): Conv2d(540, 540, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(540, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
        )
      )
      (15): SELayer(
        (avg_pool): AdaptiveAvgPool2d(output_size=1)
        (fc): Sequential(
          (0): Linear(in_features=1350, out_features=67, bias=False)
          (1): ReLU(inplace=True)
          (2): Linear(in_features=67, out_features=1350, bias=False)
          (3): Sigmoid()
        )
      )
      (16): ConvLayer(
        (conv): Conv2d(1350, 800, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm): BatchNorm2d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (prelu): PReLU(num_parameters=1)
      )
      (17): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (18): KingBlock(
        (layers): ModuleList(
          (0-1): 2 x ConvLayer(
            (conv): Conv2d(400, 400, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(400, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (2): ConvLayer(
            (conv): Conv2d(800, 800, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
        )
      )
      (19): SELayer(
        (avg_pool): AdaptiveAvgPool2d(output_size=1)
        (fc): Sequential(
          (0): Linear(in_features=1200, out_features=60, bias=False)
          (1): ReLU(inplace=True)
          (2): Linear(in_features=60, out_features=1200, bias=False)
          (3): Sigmoid()
        )
      )
      (20): ConvLayer(
        (conv): Conv2d(1200, 1200, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm): BatchNorm2d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (prelu): PReLU(num_parameters=1)
      )
      (21): Sequential(
        (0): AdaptiveAvgPool2d(output_size=(1, 1))
        (1): Flatten()
        (2): Dropout(p=0.2, inplace=False)
        (3): Linear(in_features=1200, out_features=1000, bias=True)
      )
    )
  )
  (head): LawinHead5(
    (lawin_8): LawinAttn(
      (g): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (conv_out): ConvModule(
        (conv): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (theta): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (phi): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (position_mixing): ModuleList(
        (0-63): 64 x Linear(in_features=64, out_features=64, bias=True)
      )
    )
    (lawin_4): LawinAttn(
      (g): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (conv_out): ConvModule(
        (conv): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (theta): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (phi): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (position_mixing): ModuleList(
        (0-15): 16 x Linear(in_features=64, out_features=64, bias=True)
      )
    )
    (lawin_2): LawinAttn(
      (g): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (conv_out): ConvModule(
        (conv): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (theta): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (phi): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (position_mixing): ModuleList(
        (0-3): 4 x Linear(in_features=64, out_features=64, bias=True)
      )
    )
    (image_pool): Sequential(
      (0): AdaptiveAvgPool2d(output_size=1)
      (1): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
    )
    (linear_c4): MLP(
      (proj): Linear(in_features=1200, out_features=768, bias=True)
    )
    (linear_c3): MLP(
      (proj): Linear(in_features=800, out_features=768, bias=True)
    )
    (linear_c2): MLP(
      (proj): Linear(in_features=540, out_features=768, bias=True)
    )
    (linear_c1): MLP(
      (proj): Linear(in_features=150, out_features=48, bias=True)
    )
    (linear_fuse): ConvModule(
      (conv): Conv2d(2304, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (activate): ReLU(inplace=True)
    )
    (short_path): ConvModule(
      (conv): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (activate): ReLU(inplace=True)
    )
    (cat): ConvModule(
      (conv): Conv2d(2560, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (activate): ReLU(inplace=True)
    )
    (low_level_fuse): ConvModule(
      (conv): Conv2d(560, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (activate): ReLU(inplace=True)
    )
    (ds_8): PatchEmbed(
      (proj): Conv2d(512, 512, kernel_size=(8, 8), stride=(8, 8))
      (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (ds_4): PatchEmbed(
      (proj): Conv2d(512, 512, kernel_size=(4, 4), stride=(4, 4))
      (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (ds_2): PatchEmbed(
      (proj): Conv2d(512, 512, kernel_size=(2, 2), stride=(2, 2))
      (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (class_seg): Conv2d(512, 1, kernel_size=(1, 1), stride=(1, 1))
  )
  (last3_seg): Conv2d(512, 1, kernel_size=(1, 1), stride=(1, 1))
  (last3_seg2): Conv2d(768, 1, kernel_size=(1, 1), stride=(1, 1))
)

wmd · April 30, 2024, 3:16pm

That model definition is for my adjusted version of the original model. That is what I used to pretrain weights with, and is the same model definition that I am using to train with again using my pretrained weights.

wmd · April 30, 2024, 5:12pm

Here is a dump of the full error when doing a simple load:

weights = torch.load(weight_file)
self.load_state_dict(weights)

wmd · April 30, 2024, 5:18pm

And for extra info, I am using torch 2.3.0.

ptrblck · April 30, 2024, 9:09pm

Based on the error message you are adding a backbone module which makes the keys incompatible. If you get stuck, feel free to create the minimal and executable code snippet.

wmd · April 30, 2024, 9:51pm

Is the problem because the pretrained weights were trained with a backbone, or because the new model I am trying to train (using the pretrained weights) has a backbone? I’ve never created my own pretrained weights before so there may be a fundamental gap in my knowledge here.

ptrblck · May 1, 2024, 1:27pm

The state_dict contains parameters with a backbone key while the model does not. Make sure the same attribute names are used in both models (the one to save the state_dict and the other one loading it) to avoid key mismatches.
Alternatively, if you really want to change attribute names and can map them directly, you could also manipulate the keys in the state_dict e.g. by replacing backbone with base etc. but you would need to make sure these keys are really corresponding to each other.

mhnazeri · May 3, 2024, 2:38am

As Patrick mentioned you have different names in your definition (extra backbone). You can either remove that name or define the backbone outside of your model, load the weights, and then pass it to the bigger model as an argument (this approach is better than the other one).

Also, here is a snippet that can remove the extra backbone from the name. But, I’m not sure if it works correctly or not:

w = {}
for name, par in weight.items():
    if 'backbone' in name: 
        w[".".join(name.split('.')[1:])] = par # this removes backbone from the name

self.load_state_dict(w)