Getting RuntimeError While Loading Model Weights

I trained an LLM over an EC2 instance & saved the model weights using torch.save().

When I’m trying to load the weights locally, I’m getting this error.

  File "/Users/harshmittal/Documents/GitHub/LLM Hustle/241M_Model_1.py", line 46, in <module>
    state_dict = torch.load("models/241M_Mode_1.pth")
  File "/Users/harshmittal/Library/Python/3.9/lib/python/site-packages/torch/serialization.py", line 1005, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/Users/harshmittal/Library/Python/3.9/lib/python/site-packages/torch/serialization.py", line 457, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

While things are working perfectly fine, when I train, save the weights locally & re-load them. I’m not sure why the weight file is getting corrupted on EC2.

Looking forward to a solution from the community :slight_smile:
Cheers!

Are you able to open the checkpoint manually as it’s indeed an archive? If not, the file might be corrupt indeed.

Thanks for a reply @ptrblck.

I’m new to this & not sure about opening the checkpoint manually.

I saved the model weights using “torch.save(model_instance.state_dict(), file_path)” on an EC2 instance.

Then I simply downloaded the .pth file to my system.

Can you suggest me any robust practice to save model weights? I’d appreciate it :slightly_smiling_face:

Cheers!

So when you torch.load() the file on the ec2 machine directly it works fine? In which case this almost certainly sounds like a data corruption when you’re doing the local download. How are you doing the local download?

Hi @marksaroufim,

<<< So when you torch.load() the file on the ec2 machine directly it works fine? >>>
Yes! everything worked fine.

<<<How are you doing the local download?>>>
I connected to the EC2 via VS Code SSH. I selected the file & saw an option to download (within VS code).

Would appreciate your guidance on robust approach to save & download model weights.

Cheers!

Yeah VS Code’s download from file option can be very flaky and I’ve run into this issue many times

So I’d suggest either using scp which has a format like this

scp -i /path/to/privatekey.pem ec2-user@ec2-instance-public-dns:path/to/remote/file /local/destination

Or you can copy your artifacts to an S3 bucket and then download them from there

1 Like

Oh wow! scp sounds like a robust solution to me.

Thanks @marksaroufim for sharing it. It’ll save me from lot of trouble in the future :slight_smile:

Btw - I really admire the engineering side of the Pytorch & what an engineering marvel it is. It does so many things under the hood & as a fresher, now I’m started getting the real meaning of a ‘Framework’.

Cheers!

1 Like