I was reading Save and Load model but it wasn’t clear why I’d use torch.save over pickle.dump.
What worries me is that my Neural Net modules/object have many more things inside of them besides only parameters. Since pytorch saves things using the state_dict that worried me that something might be missing (https://pytorch.org/tutorials/beginner/saving_loading_models.html).
Isn’t for my use case easier to simply use the pickle.dump function? Why would I use torch.save in general?
The state_dict will store all registered parameters and buffers.
If you need to serialize some tensors, you should thus create an nn.Parameter, if it’s trainable, or a buffer via self.register_buffer(name, tensor), if it’s not trainable.
That’s generally not recommended.
Instead you should store the state_dicts and the source files separately.
Storing e.g. the complete model could force you to recreate exactly the same file and folder structure.
Hi @ptrblck thanks for your responses! Greatly appreciated
I am curious, why is it not recommended to store the whole python program state? (or as close to that as possible).
I am currently using dill to simulate this as much as possible because my neural network module classes contain pointers to many things, including lambda functions I’d like to restore properly.
As mentioned in my last post, you could be forced to recreate the exact find and folder structure, if you save the complete model via torch.save.
I don’t know how dill is working in this case.
yes, the same file structure needs to maintain.
moreover, the tensor function will still be the one used when saving.
i loaded a yolov5 model which is version 4.0 and uses silu as act. but i accidentally used local version of yolov5 version1.0 which uses leaky relu. however, the model still uses silu. i guess the function is stored in tensor.
@ptrblck I have a follow-up question to this. In our production, we have frequently encountered incompatibility issues due to saving/loading and changed source files. For example, let’s say we trained and deployed a model in January which would be loaded like so:
from models.ExampleNet import ExampleNet
model = ExampleNet()
loaded = torch.load("example_saved_file.pt")
loaded_state_dict = loaded["state_dict"]
model.load_state_dict(loaded_state_dict)
This seems fine. However, in February some changes to the model definition were made as part of the R&D process, perhaps additional norm layers were added or the model was made deeper. These changes were subsequently pushed to dev, staging, prod and deployed to the live system. The above loading procedure now fails since the model architecture doesn’t match that of the loaded_state_dict. Though saving the model directly is not generally recommended, would doing so here with something like torch.save(trained_model) (instead of its state dict) be a good approach or at least work around? I’m curious to know if there are additional factors that would impact this that still would make it a bad idea.
I don’t believe so as source code changes could easily fail and the error messages might be quite confusing if the model definition changed.
Especially if your model uses and depends on a lot of imports from other scripts, which could be changed, the debugging could become quite challenging.