Cannot load state dict even though sizes are the same

chenjesu · March 29, 2020, 1:25am

I have run this code:

z = torch.load(load_path)

for k, v in z.items():
    print(z)
model.load_state_dict(z)

..........('base_loss', tensor([[0.]], device='cuda:0')),
             ('b', tensor(5., device='cuda:0')),
             ('x_grid',
              tensor([[[[[-1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
                          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
                          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000,
                          -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000, -1.0000],
                         [-0.9259, -0.9259, -0.9259, -0.9259, -0.9259, -0.9259, -0.9259,
                          -0.9259, -0.9259, -0.9259, -0.9259, -0.9259,.....


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-954a267c40b0> in <module>
     22 for k, v in z.items():
     23     print(z)
---> 24 model.load_state_dict(z)

~/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
    837         if len(error_msgs) > 0:
    838             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 839                                self.__class__.__name__, "\n\t".join(error_msgs)))
    840         return _IncompatibleKeys(missing_keys, unexpected_keys)
    841

RuntimeError: Error(s) in loading state_dict for IODINE:
	While copying the parameter named "x_grid", whose dimensions in the model are torch.Size([1, 1, 1, 28, 28]) and whose dimensions in the checkpoint are torch.Size([1, 1, 1, 28, 28]).
	While copying the parameter named "y_grid", whose dimensions in the model are torch.Size([1, 1, 1, 28, 28]) and whose dimensions in the checkpoint are torch.Size([1, 1, 1, 28, 28]).
	While copying the parameter named "decoder.x_grid", whose dimensions in the model are torch.Size([1, 1, 28, 28]) and whose dimensions in the checkpoint are torch.Size([1, 1, 28, 28]).
	While copying the parameter named "decoder.y_grid", whose dimensions in the model are torch.Size([1, 1, 28, 28]) and whose dimensions in the checkpoint are torch.Size([1, 1, 28, 28]).

All of the sizes are the same.
I am trying to run a version of IODINE i found on github… however, loading the saved dict into my model is trouble. I do not think I trained it with a parallel GPU, but there is that mode in the program. It does not match what I believe parallel dict should look like, based on what I’ve seen in other q&as on this site ( because there are keys with less than 7 elements)

Saving the model function looks like this:

	def save(self,save_path,epoch=None):
		print('Saving model at epoch {}'.format(epoch))
		suffix = self.name if epoch is None else self.name+'_epoch_{}.th'.format(epoch)
		model_save_path = save_path + suffix
		torch.save(self.state_dict(),model_save_path)

chenjesu · March 29, 2020, 2:44am

I think it has something to do with the fact that the grid parameters are created with linspace and meshgrid.

I tried loading in all of the parameters except the grids, but it just gave me an error that the grids were missing.

I tried something along the lines of this solution : How to load part of pre trained model? but again it gave me the same error as the first, that the grids were the same dimensions but it couldn’t copy them.

ptrblck · March 29, 2020, 6:36am

Could you try to call .contiguous() on all expanded tensors?
I’ve seen a similar issue before and will create an issue in a moment to track it.

chenjesu · March 29, 2020, 10:53pm

It worked! Thanks!!

For people looking at this in the future, this is the code I used:

model.x_grid = model.x_grid.contiguous()
model.y_grid = model.y_grid.contiguous()
model.decoder.x_grid = model.decoder.x_grid.contiguous()
model.decoder.y_grid = model.decoder.y_grid.contiguous()

model.load_state_dict(pretrained_dict)

I guess since the grid tensors were not the same size in memory as the pretrained tensors the model was trying to load, even though the tensors are the same size nominally, resulting in a cryptic error message.

ptrblck · March 29, 2020, 11:25pm

That’s also my guess and I tried to dig a bit into this issue.
It seems that this issue cannot be reproduced in the latest nightly binary, which is why I haven’t created the issue.

Sia_Rezaei · April 22, 2021, 2:23am

I have this issue when I load from a checkpoint to CPU.
It works fine without calling contiguous, if I load to GPU.

ptrblck · April 22, 2021, 2:35am

Are you using the latest PyTorch release (stable or nightly) and if so, do you have a code snippet to reproduce it on the nightly?

Sia_Rezaei · April 22, 2021, 2:44am

I’m using 1.8.1
I’ll create a snippet shortly.

Sia_Rezaei · April 22, 2021, 2:54am

@ptrblck Running this with device=‘cpu’ will trigger that error, but it is fine with cuda (torch 1.8.1)

from torch import nn
import torch

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        kernel_size = 3
        self.norm_conv = nn.Conv2d(2, 2, kernel_size)
        with torch.no_grad():
            self.norm_conv.weight = nn.Parameter(torch.ones((1, 1, kernel_size, kernel_size)).expand(2, 2, kernel_size, kernel_size), requires_grad=False)

    def forward(self, x):
        return self.norm_conv(x)

device = 'cpu'

model = Net().to(device)
torch.save(model.state_dict(), 'checkpoint.pth')

model2 = Net()
model.eval()
param_dict = torch.load('checkpoint.pth', map_location=device)
model.load_state_dict(param_dict)

ptrblck · April 22, 2021, 4:51am

Thanks for the code! I think it could be related to this issue, so feel free to comment there with your use case.