How to load a pretrained model that uses Sequential?

I have trained a model that uses Sequential. I now want to load that model to use as pretrained weights in a new model. Both the trained model and the model I am about to train have the same model definition. This is how I save the model:

torch.save(model.state_dict(), 'my_model')

Standard loading does not work:

weight_file = 'weights/my_model.pth'
weights = torch.load(weight_file)
self.load_state_dict(weights)

This gives the error:

Unexpected key(s) in state_dict
Missing key(s) in state_dict

I know that the popular solution to state_dict issues is to do this:

weight_file = 'weights/my_model.pth'
weights = torch.load(weight_file)
state_dict = weights['state_dict']
new_state_dict = OrderedDict()
for k, v in state_dict.items():
  name = k[7:] # remove `module.`
  new_state_dict[name] = v
  self.load_state_dict(new_state_dict)

This gives the error:

KeyError: 'state_dict'

Does anyone have an example of how I can load my model correctly?

nn.Sequential models should not cause any issues in saving and loading state_dicts. Could you post a minimal and executable code snippet reproducing the issue?

A minimal example might be difficult to do. However, this is the model that I am using:

https://github.com/kytimmylai/DFUC2022/blob/main/lib/kingnet.py

Note that in that code, ImageNet-1K weights are loaded. The weights are loaded without error. However, when I change the weights to a model I have trained using the same model definition, I get the errors I posted in the OP.

Could you post the model creation causing the issue?

Do you mean the model definition? If so, is this sufficient? →

(backbone): KingNet(
    (base): ModuleList(
      (0): ConvLayer(
        (conv): Conv2d(4, 30, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (ibn): IBN(
          (IN): InstanceNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
          (BN): BatchNorm2d(15, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (prelu): PReLU(num_parameters=1)
      )
      (1): ConvLayer(
        (conv): Conv2d(30, 60, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (sn): SwitchNorm2d()
        (prelu): PReLU(num_parameters=1)
      )
      (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (3): KingBlock(
        (layers): ModuleList(
          (0-9): 10 x ConvLayer(
            (conv): Conv2d(30, 30, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (10): ConvLayer(
            (conv): Conv2d(60, 60, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(60, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
        )
      )
      (4): SELayer(
        (avg_pool): AdaptiveAvgPool2d(output_size=1)
        (fc): Sequential(
          (0): Linear(in_features=150, out_features=7, bias=False)
          (1): ReLU(inplace=True)
          (2): Linear(in_features=7, out_features=150, bias=False)
          (3): Sigmoid()
        )
      )
      (5): ConvLayer(
        (conv): Conv2d(150, 120, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (prelu): PReLU(num_parameters=1)
      )
      (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (7): KingBlock(
        (layers): ModuleList(
          (0-9): 10 x ConvLayer(
            (conv): Conv2d(60, 60, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(60, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (10): ConvLayer(
            (conv): Conv2d(120, 120, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
        )
      )
      (8): SELayer(
        (avg_pool): AdaptiveAvgPool2d(output_size=1)
        (fc): Sequential(
          (0): Linear(in_features=300, out_features=15, bias=False)
          (1): ReLU(inplace=True)
          (2): Linear(in_features=15, out_features=300, bias=False)
          (3): Sigmoid()
        )
      )
      (9): ConvLayer(
        (conv): Conv2d(300, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm): BatchNorm2d(240, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (prelu): PReLU(num_parameters=1)
      )
      (10): KingBlock(
        (layers): ModuleList(
          (0): ConvLayer(
            (conv): Conv2d(40, 40, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (1-2): 2 x ConvLayer(
            (conv): Conv2d(80, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(80, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (3): ConvLayer(
            (conv): Conv2d(120, 120, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (4): ConvLayer(
            (conv): Conv2d(40, 40, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (5): ConvLayer(
            (conv): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (6): ConvLayer(
            (conv): Conv2d(40, 40, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (7): ConvLayer(
            (conv): Conv2d(120, 120, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(120, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (8-9): 2 x ConvLayer(
            (conv): Conv2d(80, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(80, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (10): ConvLayer(
            (conv): Conv2d(40, 40, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(40, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (11): ConvLayer(
            (conv): Conv2d(240, 240, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(240, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
        )
      )
      (11): SELayer(
        (avg_pool): AdaptiveAvgPool2d(output_size=1)
        (fc): Sequential(
          (0): Linear(in_features=560, out_features=28, bias=False)
          (1): ReLU(inplace=True)
          (2): Linear(in_features=28, out_features=560, bias=False)
          (3): Sigmoid()
        )
      )
      (12): ConvLayer(
        (conv): Conv2d(560, 540, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm): BatchNorm2d(540, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (prelu): PReLU(num_parameters=1)
      )
      (13): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (14): KingBlock(
        (layers): ModuleList(
          (0-9): 10 x ConvLayer(
            (conv): Conv2d(270, 270, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(270, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (10): ConvLayer(
            (conv): Conv2d(540, 540, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(540, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
        )
      )
      (15): SELayer(
        (avg_pool): AdaptiveAvgPool2d(output_size=1)
        (fc): Sequential(
          (0): Linear(in_features=1350, out_features=67, bias=False)
          (1): ReLU(inplace=True)
          (2): Linear(in_features=67, out_features=1350, bias=False)
          (3): Sigmoid()
        )
      )
      (16): ConvLayer(
        (conv): Conv2d(1350, 800, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm): BatchNorm2d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (prelu): PReLU(num_parameters=1)
      )
      (17): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (18): KingBlock(
        (layers): ModuleList(
          (0-1): 2 x ConvLayer(
            (conv): Conv2d(400, 400, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(400, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
          (2): ConvLayer(
            (conv): Conv2d(800, 800, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (norm): BatchNorm2d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            (prelu): PReLU(num_parameters=1)
          )
        )
      )
      (19): SELayer(
        (avg_pool): AdaptiveAvgPool2d(output_size=1)
        (fc): Sequential(
          (0): Linear(in_features=1200, out_features=60, bias=False)
          (1): ReLU(inplace=True)
          (2): Linear(in_features=60, out_features=1200, bias=False)
          (3): Sigmoid()
        )
      )
      (20): ConvLayer(
        (conv): Conv2d(1200, 1200, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm): BatchNorm2d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (prelu): PReLU(num_parameters=1)
      )
      (21): Sequential(
        (0): AdaptiveAvgPool2d(output_size=(1, 1))
        (1): Flatten()
        (2): Dropout(p=0.2, inplace=False)
        (3): Linear(in_features=1200, out_features=1000, bias=True)
      )
    )
  )
  (head): LawinHead5(
    (lawin_8): LawinAttn(
      (g): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (conv_out): ConvModule(
        (conv): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (theta): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (phi): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (position_mixing): ModuleList(
        (0-63): 64 x Linear(in_features=64, out_features=64, bias=True)
      )
    )
    (lawin_4): LawinAttn(
      (g): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (conv_out): ConvModule(
        (conv): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (theta): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (phi): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (position_mixing): ModuleList(
        (0-15): 16 x Linear(in_features=64, out_features=64, bias=True)
      )
    )
    (lawin_2): LawinAttn(
      (g): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (conv_out): ConvModule(
        (conv): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (theta): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (phi): ConvModule(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
      )
      (position_mixing): ModuleList(
        (0-3): 4 x Linear(in_features=64, out_features=64, bias=True)
      )
    )
    (image_pool): Sequential(
      (0): AdaptiveAvgPool2d(output_size=1)
      (1): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
    )
    (linear_c4): MLP(
      (proj): Linear(in_features=1200, out_features=768, bias=True)
    )
    (linear_c3): MLP(
      (proj): Linear(in_features=800, out_features=768, bias=True)
    )
    (linear_c2): MLP(
      (proj): Linear(in_features=540, out_features=768, bias=True)
    )
    (linear_c1): MLP(
      (proj): Linear(in_features=150, out_features=48, bias=True)
    )
    (linear_fuse): ConvModule(
      (conv): Conv2d(2304, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (activate): ReLU(inplace=True)
    )
    (short_path): ConvModule(
      (conv): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (activate): ReLU(inplace=True)
    )
    (cat): ConvModule(
      (conv): Conv2d(2560, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (activate): ReLU(inplace=True)
    )
    (low_level_fuse): ConvModule(
      (conv): Conv2d(560, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (activate): ReLU(inplace=True)
    )
    (ds_8): PatchEmbed(
      (proj): Conv2d(512, 512, kernel_size=(8, 8), stride=(8, 8))
      (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (ds_4): PatchEmbed(
      (proj): Conv2d(512, 512, kernel_size=(4, 4), stride=(4, 4))
      (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (ds_2): PatchEmbed(
      (proj): Conv2d(512, 512, kernel_size=(2, 2), stride=(2, 2))
      (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (class_seg): Conv2d(512, 1, kernel_size=(1, 1), stride=(1, 1))
  )
  (last3_seg): Conv2d(512, 1, kernel_size=(1, 1), stride=(1, 1))
  (last3_seg2): Conv2d(768, 1, kernel_size=(1, 1), stride=(1, 1))
)

That model definition is for my adjusted version of the original model. That is what I used to pretrain weights with, and is the same model definition that I am using to train with again using my pretrained weights.

Here is a dump of the full error when doing a simple load:

weights = torch.load(weight_file)
self.load_state_dict(weights)

And for extra info, I am using torch 2.3.0.

Based on the error message you are adding a backbone module which makes the keys incompatible. If you get stuck, feel free to create the minimal and executable code snippet.

Is the problem because the pretrained weights were trained with a backbone, or because the new model I am trying to train (using the pretrained weights) has a backbone? I’ve never created my own pretrained weights before so there may be a fundamental gap in my knowledge here.

The state_dict contains parameters with a backbone key while the model does not. Make sure the same attribute names are used in both models (the one to save the state_dict and the other one loading it) to avoid key mismatches.
Alternatively, if you really want to change attribute names and can map them directly, you could also manipulate the keys in the state_dict e.g. by replacing backbone with base etc. but you would need to make sure these keys are really corresponding to each other.

As Patrick mentioned you have different names in your definition (extra backbone). You can either remove that name or define the backbone outside of your model, load the weights, and then pass it to the bigger model as an argument (this approach is better than the other one).

Also, here is a snippet that can remove the extra backbone from the name. But, I’m not sure if it works correctly or not:

w = {}
for name, par in weight.items():
    if 'backbone' in name: 
        w[".".join(name.split('.')[1:])] = par # this removes backbone from the name

self.load_state_dict(w)