RuntimeError: Error(s) in loading state_dict for model

muammar · February 26, 2019, 11:41pm

I have a NeuralNetwork model in which I registered in purpose two variables. When I save the state_dict to load back the model for evaluation I am getting the error:

  File "../mlchem/potentials.py", line 191, in calculate
    strict=True)
  File "/home/muammar/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for NeuralNetwork:
	Unexpected key(s) in state_dict: "intercept_Cu", "slope_Cu".

How can you load a model like this? I need those two variables to be recognized by .load_state_dict() I would appreciate any suggestions.

PS. I checked other posts but I am not having the same error as they reported.

ptrblck · February 27, 2019, 12:35am

Could you post the model definition or parts of it with the saving and loading code?

muammar · February 27, 2019, 12:47am

This can be reproduced by https://github.com/muammar/mlchem/tree/master/examples.

I assume the problem at inference time is that I instantiate the neural network class and those unexpected keys are not there. If I remove the conditional statement here I don’t get the error but inferences are wrong and the custom parameters are not the ones I optimized.

ptrblck · February 27, 2019, 12:54am

Could you try to register these parameters as None in the non-training case, so that they will always be inside the model, and try to load it again?

muammar · February 27, 2019, 4:32am

@ptrblck I tried with None but that did not work. However, if I set those custom variables to zero the model is loaded with strict=True. This is the commit 233aaa6 that fixed it.

Now, there is something weird… When using the state dictionary that is loaded to predict the points in the training set and verify things, then they don’t match… I will investigate more.

muammar · February 27, 2019, 4:46am

To elaborate more. During the training phase I get these outputs and respective targets for this regression problem:

tensor([-14.5732574463, -14.5583467484, -14.5344114304, -14.5010814667,
        -14.4517145157, -14.4076881409, -14.3827638626, -14.3705654144,
        -14.3653717041, -14.3649988174], grad_fn=<StackBackward>)
targets
tensor([-14.5868730545, -14.5640010834, -14.5292606354, -14.4874134064,
        -14.4443883896, -14.4063940048, -14.3788156509, -14.3651800156,
        -14.3665180206, -14.3813629150])

After loading the model from the state_dict this is what I get:

mlchem predicted energy = -14.418486595153809
mlchem predicted energy = -14.418549537658691
mlchem predicted energy = -14.418668746948242
mlchem predicted energy = -14.418920516967773
mlchem predicted energy = -14.419416427612305
mlchem predicted energy = -14.419506072998047
mlchem predicted energy = -14.419241905212402
mlchem predicted energy = -14.418638229370117
mlchem predicted energy = -14.41697883605957
mlchem predicted energy = -14.413832664489746

chenglu · February 27, 2019, 5:26am

The parameters of a model should be consistency, no matter what mode it is (train or eval). So your patch commit is the right way to go.

And the parameters you used when training and inference have different values, so it will have different output. I do not know what you are actually doing here but the output of training and inference don’t match seems to reasonable to me.

And, it’s weird to loop the parameters in forward , you can refactor your code like this:

slope = getattr(slope_name, self)
intercept = getattr(intercept_name, self)

muammar · February 28, 2019, 3:58am

The parameters of a model should be consistency, no matter what mode it is (train or eval). So your patch commit is the right way to go.

Yes, that is correct. I thought only the hidden layers were needed but actually one needs to recreate the whole model.

And the parameters you used when training and inference have different values, so it will have different output. I do not know what you are actually doing here but the output of training and inference don’t match seems to reasonable to me.

Not at all in this case. Let me elaborate more. Those outputs shown above come from predicting over the training set using the state_dict of the epoch that fulfilled the criterion for training. That is the state_dict I saved so the predictions they have to match. Now, I found that the error was that I did not scale the features with the same scaler used for training.

And, it’s weird to loop the parameters in forward , you can refactor your code like this:
slope = getattr(slope_name, self)
intercept = getattr(intercept_name, self)

Thanks, @chenglu I will do that.

muammar · February 28, 2019, 4:12am

By the way, I corrected the getattr snippets above:

slope = getattr(self, slope_name)
intercept = getattr(self, intercept_name)