Getting RuntimeError While Loading Model Weights

I trained an LLM over an EC2 instance & saved the model weights using

When I’m trying to load the weights locally, I’m getting this error.

  File "/Users/harshmittal/Documents/GitHub/LLM Hustle/", line 46, in <module>
    state_dict = torch.load("models/241M_Mode_1.pth")
  File "/Users/harshmittal/Library/Python/3.9/lib/python/site-packages/torch/", line 1005, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/Users/harshmittal/Library/Python/3.9/lib/python/site-packages/torch/", line 457, in __init__
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

While things are working perfectly fine, when I train, save the weights locally & re-load them. I’m not sure why the weight file is getting corrupted on EC2.

Looking forward to a solution from the community :slight_smile:

Are you able to open the checkpoint manually as it’s indeed an archive? If not, the file might be corrupt indeed.

Thanks for a reply @ptrblck.

I’m new to this & not sure about opening the checkpoint manually.

I saved the model weights using “, file_path)” on an EC2 instance.

Then I simply downloaded the .pth file to my system.

Can you suggest me any robust practice to save model weights? I’d appreciate it :slightly_smiling_face:


So when you torch.load() the file on the ec2 machine directly it works fine? In which case this almost certainly sounds like a data corruption when you’re doing the local download. How are you doing the local download?

Hi @marksaroufim,

<<< So when you torch.load() the file on the ec2 machine directly it works fine? >>>
Yes! everything worked fine.

<<<How are you doing the local download?>>>
I connected to the EC2 via VS Code SSH. I selected the file & saw an option to download (within VS code).

Would appreciate your guidance on robust approach to save & download model weights.


Yeah VS Code’s download from file option can be very flaky and I’ve run into this issue many times

So I’d suggest either using scp which has a format like this

scp -i /path/to/privatekey.pem ec2-user@ec2-instance-public-dns:path/to/remote/file /local/destination

Or you can copy your artifacts to an S3 bucket and then download them from there

Oh wow! scp sounds like a robust solution to me.

Thanks @marksaroufim for sharing it. It’ll save me from lot of trouble in the future :slight_smile:

Btw - I really admire the engineering side of the Pytorch & what an engineering marvel it is. It does so many things under the hood & as a fresher, now I’m started getting the real meaning of a ‘Framework’.


