RuntimeError: storage has wrong size

kaimss · July 6, 2020, 12:45pm

I use single machine with 4 gpus to train my models, and then I save paramaters of the model with torch.save(aemodel.state_dict(), file). And when I use aemodel.load_state_dict(torch.load(output_path + 'ae.pkl', map_location='cpu')) to load its paramaters, error occurs:

Traceback (most recent call last):
  File "Run.py", line 714, in <module>
    ready_train(args)
  File "Run.py", line 384, in ready_train
    aemodel.load_state_dict(torch.load(output_path + 'ae.pkl', map_location='cpu'))
  File "/home_ex/tianhongtao/SW/anaconda3/envs/test/lib/python3.7/site-packages/torch/serialization.py", line 593, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home_ex/tianhongtao/SW/anaconda3/envs/test/lib/python3.7/site-packages/torch/serialization.py", line 780, in _legacy_load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 262144 got 1536

Could anyone tell me what happened? And how do I save and load model with distributed training? Thank you very much.

ptrblck · July 7, 2020, 8:16am

This error might be raised, if you were using multiple processes (e.g. via DistributedDataParallel) and didn’t guard the storing of the checkpoint against multiple writers.
You could use the ImageNet example to only use the first process to store the checkpoint.

kaimss · July 7, 2020, 9:19am

Thank you very much~I understand~

Brando_Miranda · February 28, 2022, 11:48pm

this was fixed for me once I uploaded the checkpoint to the cluster again.