Errors while fine-tuning a pretrained model

Najeh_Nafti · April 5, 2021, 8:43pm

I got the below error when I tried to pretrain a model from github on my own dataset:

File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Generator:
	size mismatch for linear.weight: copying a param with shape torch.Size([16384, 128]) from checkpoint, the shape in current model is torch.Size([24576, 17]).
size mismatch for linear.bias: copying a param with shape torch.Size([16384]) from checkpoint, the shape in current model is torch.Size([24576]).
	size mismatch for linear.u0: copying a param with shape torch.Size([1, 16384]) from checkpoint, the shape in current model is torch.Size([1, 24576]).
	size mismatch for blocks.0.0.conv1.weight: copying a param with shape torch.Size([1024, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([1536, 1536, 3, 3]).
	size mismatch for blocks.0.0.conv1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([1536]).
	size mismatch for blocks.0.0.conv1.u0: copying a param with shape torch.Size([1, 1024]) from checkpoint, the shape in current model is torch.Size([1, 1536]).
	size mismatch for blocks.0.0.conv2.weight: copying a param with shape torch.Size([1024, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([1536, 1536, 3, 3]).
	size mismatch for blocks.0.0.conv2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([1536]).
	size mismatch for blocks.0.0.conv2.u0: copying a param with shape torch.Size([1, 1024]) from checkpoint, the shape in current model is torch.Size([1, 1536]).
	size mismatch for blocks.0.0.conv_sc.weight: copying a param with shape torch.Size([1024, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([1536, 1536, 1, 1]).
	size mismatch for blocks.0.0.conv_sc.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([1536]).
	size mismatch for blocks.0.0.conv_sc.u0: copying a param with shape torch.Size([1, 1024]) from checkpoint, the shape in current model is torch.Size([1, 1536]).
	size mismatch for blocks.0.0.bn1.stored_mean: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([1536]).
	size mismatch for blocks.0.0.bn1.stored_var: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([1536]).
	size mismatch for blocks.0.0.bn2.stored_mean: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([1536]).
	size mismatch for blocks.0.0.bn2.stored_var: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([1536]).
	size mismatch for blocks.1.0.conv1.weight: copying a param with shape torch.Size([512, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([768, 1536, 3, 3]).
	size mismatch for blocks.1.0.conv1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.1.0.conv1.u0: copying a param with shape torch.Size([1, 512]) from checkpoint, the shape in current model is torch.Size([1, 768]).
	size mismatch for blocks.1.0.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([768, 768, 3, 3]).
	size mismatch for blocks.1.0.conv2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.1.0.conv2.u0: copying a param with shape torch.Size([1, 512]) from checkpoint, the shape in current model is torch.Size([1, 768]).
	size mismatch for blocks.1.0.conv_sc.weight: copying a param with shape torch.Size([512, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([768, 1536, 1, 1]).
	size mismatch for blocks.1.0.conv_sc.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.1.0.conv_sc.u0: copying a param with shape torch.Size([1, 512]) from checkpoint, the shape in current model is torch.Size([1, 768]).
	size mismatch for blocks.1.0.bn1.stored_mean: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([1536]).
	size mismatch for blocks.1.0.bn1.stored_var: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([1536]).
	size mismatch for blocks.1.0.bn2.stored_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.1.0.bn2.stored_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.2.0.conv1.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([768, 768, 3, 3]).
	size mismatch for blocks.2.0.conv1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.2.0.conv1.u0: copying a param with shape torch.Size([1, 512]) from checkpoint, the shape in current model is torch.Size([1, 768]).
	size mismatch for blocks.2.0.conv2.weight: copying a param with shape torch.Size([512, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([768, 768, 3, 3]).
	size mismatch for blocks.2.0.conv2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.2.0.conv2.u0: copying a param with shape torch.Size([1, 512]) from checkpoint, the shape in current model is torch.Size([1, 768]).
	size mismatch for blocks.2.0.conv_sc.weight: copying a param with shape torch.Size([512, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([768, 768, 1, 1]).
	size mismatch for blocks.2.0.conv_sc.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.2.0.conv_sc.u0: copying a param with shape torch.Size([1, 512]) from checkpoint, the shape in current model is torch.Size([1, 768]).
	size mismatch for blocks.2.0.bn1.stored_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.2.0.bn1.stored_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.2.0.bn2.stored_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.2.0.bn2.stored_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.3.0.conv1.weight: copying a param with shape torch.Size([256, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 768, 3, 3]).
	size mismatch for blocks.3.0.conv1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
	size mismatch for blocks.3.0.conv1.u0: copying a param with shape torch.Size([1, 256]) from checkpoint, the shape in current model is torch.Size([1, 384]).
	size mismatch for blocks.3.0.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3]).
	size mismatch for blocks.3.0.conv2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
	size mismatch for blocks.3.0.conv2.u0: copying a param with shape torch.Size([1, 256]) from checkpoint, the shape in current model is torch.Size([1, 384]).
	size mismatch for blocks.3.0.conv_sc.weight: copying a param with shape torch.Size([256, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 768, 1, 1]).
	size mismatch for blocks.3.0.conv_sc.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
	size mismatch for blocks.3.0.conv_sc.u0: copying a param with shape torch.Size([1, 256]) from checkpoint, the shape in current model is torch.Size([1, 384]).
	size mismatch for blocks.3.0.bn1.stored_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.3.0.bn1.stored_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([768]).
	size mismatch for blocks.3.0.bn2.stored_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
	size mismatch for blocks.3.0.bn2.stored_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
	size mismatch for blocks.4.0.conv1.weight: copying a param with shape torch.Size([128, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([192, 384, 3, 3]).
	size mismatch for blocks.4.0.conv1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
	size mismatch for blocks.4.0.conv1.u0: copying a param with shape torch.Size([1, 128]) from checkpoint, the shape in current model is torch.Size([1, 192]).
	size mismatch for blocks.4.0.conv2.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([192, 192, 3, 3]).
	size mismatch for blocks.4.0.conv2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
	size mismatch for blocks.4.0.conv2.u0: copying a param with shape torch.Size([1, 128]) from checkpoint, the shape in current model is torch.Size([1, 192]).
	size mismatch for blocks.4.0.conv_sc.weight: copying a param with shape torch.Size([128, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([192, 384, 1, 1]).
	size mismatch for blocks.4.0.conv_sc.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
	size mismatch for blocks.4.0.conv_sc.u0: copying a param with shape torch.Size([1, 128]) from checkpoint, the shape in current model is torch.Size([1, 192]).
	size mismatch for blocks.4.0.bn1.stored_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
	size mismatch for blocks.4.0.bn1.stored_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([384]).
	size mismatch for blocks.4.0.bn2.stored_mean: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
	size mismatch for blocks.4.0.bn2.stored_var: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
	size mismatch for blocks.5.0.conv1.weight: copying a param with shape torch.Size([64, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([96, 192, 3, 3]).
	size mismatch for blocks.5.0.conv1.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
	size mismatch for blocks.5.0.conv1.u0: copying a param with shape torch.Size([1, 64]) from checkpoint, the shape in current model is torch.Size([1, 96]).
	size mismatch for blocks.5.0.conv2.weight: copying a param with shape torch.Size([64, 64, 3, 3]) from checkpoint, the shape in current model is torch.Size([96, 96, 3, 3]).
	size mismatch for blocks.5.0.conv2.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
	size mismatch for blocks.5.0.conv2.u0: copying a param with shape torch.Size([1, 64]) from checkpoint, the shape in current model is torch.Size([1, 96]).
	size mismatch for blocks.5.0.conv_sc.weight: copying a param with shape torch.Size([64, 128, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 192, 1, 1]).
	size mismatch for blocks.5.0.conv_sc.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
	size mismatch for blocks.5.0.conv_sc.u0: copying a param with shape torch.Size([1, 64]) from checkpoint, the shape in current model is torch.Size([1, 96]).
	size mismatch for blocks.5.0.bn1.stored_mean: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
	size mismatch for blocks.5.0.bn1.stored_var: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([192]).
	size mismatch for blocks.5.0.bn2.stored_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
	size mismatch for blocks.5.0.bn2.stored_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
	size mismatch for output_layer.0.gain: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
	size mismatch for output_layer.0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
	size mismatch for output_layer.0.stored_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
	size mismatch for output_layer.0.stored_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([96]).
	size mismatch for output_layer.2.weight: copying a param with shape torch.Size([3, 64, 3, 3]) from checkpoint, the shape in current model is torch.Size([3, 96, 3, 3]).

Dwight_Foster · April 5, 2021, 9:42pm

Can you show your code. And are you loading the correct model because this error is telling you that you aren’t.

Shima_Shahfar · April 5, 2021, 10:58pm

Is the dimensions of the network you are creating the same as the one for the network you are using?

Najeh_Nafti · April 5, 2021, 11:12pm

I’m fine tuning the unetgan model which is a combination between bigGAN and Unet:
https://github.com/boschresearch/unetgan

and I am loading the correct model

Najeh_Nafti · April 5, 2021, 11:12pm

yes the data have the same dimension

Shima_Shahfar · April 9, 2021, 4:17am

That’s odd because the error shows the dimension of the network you’ve created is different from the one you are loading. And if you didn’t change the architecture or hard-coded any thing related to the network, you should at least be able to load the weights.
Would you please try to use their own load_weights function and see if you can load the weights? You can find it in utils.py.

Najeh_Nafti · April 9, 2021, 10:53pm

I think i don’t have the same dimension, how can i verify that?

Najeh_Nafti · April 9, 2021, 10:54pm

No I can not load the weights

Shima_Shahfar · April 10, 2021, 5:49pm

Can you please share your code for the part you are loading the weights and initiating the model?

Najeh_Nafti · April 11, 2021, 12:05am

this is the used load_weight function:

def load_weights(G, D, state_dict, weights_root, experiment_name,config, epoch_id = '',
                 name_suffix=None, G_ema=None, strict=True, load_optim=True):
  #"/dlc/Employees/esc2rng/biggan_abst/56932_1708_c_ctrl/weights"
  #root = '/'.join([weights_root, experiment_name])
  #root = "/dlc/Employees/esc2rng/biggan_abst/56932_1708_c_ctrl/weights/BigGAN_coco_seed0_Gch64_Dch64_bs64_Glr5.0e-05_Dlr2.0e-04_Gnlrelu_Dnlrelu_Ginitortho_Dinitortho_Gattn64_Dattn64/"
  root = config["resume_from"]
  if name_suffix:
    print('Loading %s weights from %s...' % (name_suffix, root))
  else:
    print('Loading weights from %s...' % root)
  print("epoch id : ", epoch_id)
  if G is not None:
    G.load_state_dict(
      torch.load('%s/%s.pth' % (root, join_strings('_', ['G', epoch_id,  name_suffix]))),
      strict=strict)
    if load_optim:
      s = torch.load('%s/%s.pth' % (root, join_strings('_', ['G_optim', epoch_id,  name_suffix])))
      print(">>" , len(s))
      #print(s)
      G.optim.load_state_dict(
        torch.load('%s/%s.pth' % (root, join_strings('_', ['G_optim', epoch_id,  name_suffix]))))
  if D is not None:
    D.load_state_dict(
      torch.load('%s/%s.pth' % (root, join_strings('_', ['D', epoch_id,  name_suffix]))),
      strict=strict)
    if load_optim:
      D.optim.load_state_dict(
        torch.load('%s/%s.pth' % (root, join_strings('_', ['D_optim', epoch_id,  name_suffix]))))
  # Load state dict
  for item in state_dict:
    D = torch.load('%s/%s.pth' % (root, join_strings('_', ['state_dict', epoch_id,  name_suffix])))
    if item in D:
      state_dict[item] = D[item]
    else:
      print(item, " not in state_dict, creating it ...")
      state_dict[item] = []

  if G_ema is not None:
    G_ema.load_state_dict(
      torch.load('%s/%s.pth' % (root, join_strings('_', ['G_ema', epoch_id, name_suffix]))),
      strict=strict)

Shima_Shahfar · April 11, 2021, 3:02am

Yes, I saw this in the Github repo you shared. I meant how do you create your model and call this function when you are trying to load the weights.

Najeh_Nafti · April 11, 2021, 9:31pm

After define the dataloader in train.py file I used this code

!python train.py \
--dataset mydata   \
--which_best FID \
--batch_size 16 --num_G_accumulations 1 --num_D_accumulations 1 \
--num_D_steps 1 --G_lr 5e-5 --D_lr 2e-4 --D_B2 0.999 --G_B2 0.999 \
--G_attn 0 --D_attn 0 \
--SN_eps 1e-6 --BN_eps 1e-5 --adam_eps 1e-6 \
--G_ortho 0.0 \
--seed 99 \
--G_init ortho --D_init ortho \
--G_eval_mode \
--G_ch 64 --D_ch 64 \
--hier --dim_z 128 \
--ema --use_ema --ema_start 21000 \
--accumulate_stats --num_standing_accumulations 100  \
--test_every 10000 --save_every 10000 --num_best_copies 2 --num_save_copies 1 --seed 0 \
--sample_every 4000   \
--id mydata_unet_bce_noatt_cutmix_consist --gpus "0,1"  \
--unconditional --warmup_epochs 20 \
--unet_mixup --consistency_loss_and_augmentation \
--base_root /content/drive/MyDrive/unetgan/results \
--data_folder /content/drive/MyDrive/mydata \
--resume_from /content/drive/MyDrive/unetgan/pretrained_model  --resume --epoch_id ep_82

Shima_Shahfar · April 17, 2021, 6:01pm

First of all, you don’t have to pass all the parameters when you are using the default value.
I hope you solved it by now but I suggest try loading the pre-trained weights for a dataset they trained on and not your own dataset and see if it works. Then if it was working fine, it’s probably because of the first issue I mentioned. Usually, the dimensions of feature maps will be defined using the size of your input to prevent hard-coding for different input sizes. But I think the output of your dataset have a different dimensions from the one used in pre-training. You can check it by creating an object from your dataset class and just printing the shape of a sample.

Najeh_Nafti · April 18, 2021, 1:58am

It works for one of the datasets, so could i change the content of folders of the worked dataset by my own data and fine-tune the model using the pretrained weights?

Shima_Shahfar · April 18, 2021, 2:45am

No, most probably it won’t work that way. Just make the output of your dataset to have the same shape and formatting as the output in that dataset.