I use single machine with 4 gpus to train my models, and then I save paramaters of the model with torch.save(aemodel.state_dict(), file)
. And when I use aemodel.load_state_dict(torch.load(output_path + 'ae.pkl', map_location='cpu'))
to load its paramaters, error occurs:
Traceback (most recent call last):
File "Run.py", line 714, in <module>
ready_train(args)
File "Run.py", line 384, in ready_train
aemodel.load_state_dict(torch.load(output_path + 'ae.pkl', map_location='cpu'))
File "/home_ex/tianhongtao/SW/anaconda3/envs/test/lib/python3.7/site-packages/torch/serialization.py", line 593, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home_ex/tianhongtao/SW/anaconda3/envs/test/lib/python3.7/site-packages/torch/serialization.py", line 780, in _legacy_load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 262144 got 1536
Could anyone tell me what happened? And how do I save and load model with distributed training? Thank you very much.