Pickle data corruption with big tensors

Hello, I’ve experienced a serious data corruption while saving tensors to a pickle file (about 1 giga byte of data). Probably I’ve tried to write to it again before the previous writing was complete? It’s a google colab script that saves data to its google drive environment, and each writing operation came a minute or so after the previous writing… no multithreading involved, just a saving operation in a while loop. How can I make sure that this doesn’t happen again?? Is there a way to check if there’s a file lock, or if there’s a process still writing to it?


it saves like this

torch.save(colonne_out, ‘0.pt’)

after saving the file a lot of times, I tried to load the file but it gave an error

colonne_out = torch.load(pickle_file, map_location=torch.device(device)

RuntimeError Traceback (most recent call last)

in ()
1 # se sono offline, carico il file delle colonne di output
----> 2 colonne_out = torch.load(pickle_file, map_location=torch.device(device))

1 frames

/usr/local/lib/python3.7/dist-packages/torch/serialization.py in init(self, name_or_buffer)
240 class _open_zipfile_reader(_opener):
241 def init(self, name_or_buffer) → None:
→ 242 super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

I cannot reproduce the issue using this small example:

# 4GB
x = torch.randn(1024**3)

for _ in range(10):
    torch.save(x, "tmp.pt")
    y = torch.load("tmp.pt")

Could you check, if you are hitting the same error using my code?

1 Like

Thanks for your feedback I’ve made it run long enough on google colab and I got no error. Since it happened to me at least once a week, I wonder where the error came from. Anyway, how does the corruption happen? Is there a way to check if the data has been written, or the only solution is trying to load it to check it?

Often file corruptions are caused by writes to the same files from multiple processes. Since you have apparently excluded this issue I don’t know what might be causing this error as I haven’t seen it before.

1 Like