Getting RuntimeError While Loading Model Weights

Harsh_Mittal · March 30, 2024, 11:25am

I trained an LLM over an EC2 instance & saved the model weights using torch.save().

When I’m trying to load the weights locally, I’m getting this error.

  File "/Users/harshmittal/Documents/GitHub/LLM Hustle/241M_Model_1.py", line 46, in <module>
    state_dict = torch.load("models/241M_Mode_1.pth")
  File "/Users/harshmittal/Library/Python/3.9/lib/python/site-packages/torch/serialization.py", line 1005, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/Users/harshmittal/Library/Python/3.9/lib/python/site-packages/torch/serialization.py", line 457, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

While things are working perfectly fine, when I train, save the weights locally & re-load them. I’m not sure why the weight file is getting corrupted on EC2.

Looking forward to a solution from the community
Cheers!

ptrblck · March 30, 2024, 4:18pm

Are you able to open the checkpoint manually as it’s indeed an archive? If not, the file might be corrupt indeed.

Harsh_Mittal · April 2, 2024, 4:36am

Thanks for a reply @ptrblck.

I’m new to this & not sure about opening the checkpoint manually.

I saved the model weights using “torch.save(model_instance.state_dict(), file_path)” on an EC2 instance.

Then I simply downloaded the .pth file to my system.

Can you suggest me any robust practice to save model weights? I’d appreciate it

Cheers!

marksaroufim · April 2, 2024, 5:40am

So when you torch.load() the file on the ec2 machine directly it works fine? In which case this almost certainly sounds like a data corruption when you’re doing the local download. How are you doing the local download?

Harsh_Mittal · April 2, 2024, 5:56am

Hi @marksaroufim,

<<< So when you torch.load() the file on the ec2 machine directly it works fine? >>>
Yes! everything worked fine.

<<<How are you doing the local download?>>>
I connected to the EC2 via VS Code SSH. I selected the file & saw an option to download (within VS code).

Would appreciate your guidance on robust approach to save & download model weights.

Cheers!

marksaroufim · April 2, 2024, 3:31pm

Yeah VS Code’s download from file option can be very flaky and I’ve run into this issue many times

So I’d suggest either using scp which has a format like this

scp -i /path/to/privatekey.pem ec2-user@ec2-instance-public-dns:path/to/remote/file /local/destination

Or you can copy your artifacts to an S3 bucket and then download them from there

Harsh_Mittal · April 2, 2024, 5:40pm

Oh wow! scp sounds like a robust solution to me.

Thanks @marksaroufim for sharing it. It’ll save me from lot of trouble in the future

Btw - I really admire the engineering side of the Pytorch & what an engineering marvel it is. It does so many things under the hood & as a fresher, now I’m started getting the real meaning of a ‘Framework’.

Cheers!