Loaded pytorch model gives different results than originally trained model

I wrote Python code that varies hyperparameters of an LSTM model and saves the trained model that tests best. When I load this model, testing on the same data with the same parameters, it predicts differently. If I then retest the loaded version, it is now consistent. I have tried all the torch.save/load variations that I can find. Any suggestions or pointers to articles on this would be appreciated.

Does this mean you are just repeating exactly the same workflow and the second time the predictions change and match your expectation?

1 Like

Yes, I train and test (test 1) a time series model, save it, load it (specifically to check if the predictions on the test set are repeatable), and then test (test 2) the loaded model on the same set of test data. The predictions are different. If I then load this model and test again (test 3), the predictions are consistent with test 2. I see this behavior with Linear AND LSTM models. I have tried all the save/load combinations I have read about.

Thanks for the explanation.
So it seems test 2 and test 3 (both tests are loading the model and are comparing predictions on the same test set) are matching. However the predictions created right after the training (test 1) do not match test 2 or 3 (after loading the model again).
We’ve often seen such behavior pointing to a user error by e.g. changing the data processing.
To debug the issue I would recommend to compare the model outputs in eval() mode right after the training (test 1) using static input data (e.g. torch.ones or a serialized random tensor) against the predictions created after loading the model (test 2 and 3).
If these differ, the the model itself might not have been fully restored for some reason.
However, if the predictions using a static input match you could check the data loading and processing next.

I take it from your suggestions that one SHOULD expect a loaded model to behave the same as the model behaved before it was saved. Being new to this, I did not even know if that expectation was reality. I will try simpler test data and see if the behavior repeats. You mentioned “we’ve often see such behavior pointing to a user error … e.g. changing the data processing.” Please point me to any other discussions of this that come to mind. Thanks for the help!

Yes, assuming the model does not apply any randomness (e.g. model.eval() will disable dropout layers) and the same data is used. In this case you would expect to see the same output up to the expected mismatches caused by the limited floating point precision (you could also enable deterministic outputs for a performance hit).

A quick search pointed to e.g. this issue which seems to be caused by a difference in the validation set.

Maybe because of a random seed or generator placed somewhere in your snippets which initializes your loaded state dict again!?

Not that I know of. But PyTorch is a black box (to me) and that may be happening. That the loaded model almost always performs worse, may itself be a hint that I’m just cherry-picking the best parameters when training.

Okay try explicitly setting requires_grad = True for all the Tensors and then see if anything changes. Pytorch doesnt track all the grads as it is set to False by default. Hence it only keeps selective .grad values in the Model’s state dict and the leaf nodes gradients are being randomly initialized during loading states. Maybe thats the root cause. Just a Hunch!

Continuing this reproducibility issue, I decided to save the trained Pytorch LSTM model, then reload it to test in the same Python script (script 1), rather than save it, exit the script, and then load it to test in script 2. I select the best tested model in script 1, save that model, and load it from the script 2. The point was to see if the load/save sequence was introducing the randomness, but that is not the case. The test results are different between 1 and 2, but are consistent in subsequent runs of 2. I have added the code below to both scripts:

torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
torch.use_deterministic_algorithms(True)

I have also set all tensors to requires_grad=True per another responder.

Any ideas/suggestions? Thanks.

Could you post a minimal and executable code snippet reproducing this behavior please?

Thanks for the response. I “fixed” it by setting a random seed before just about every piece of code that could possibly introduce some randomness. Now I’ll just remove each seed call one by one to see which call to Numpy or PyTorch was involved.