Pickle data corruption with big tensors

Giorgio · November 17, 2021, 8:20am

Hello, I’ve experienced a serious data corruption while saving tensors to a pickle file (about 1 giga byte of data). Probably I’ve tried to write to it again before the previous writing was complete? It’s a google colab script that saves data to its google drive environment, and each writing operation came a minute or so after the previous writing… no multithreading involved, just a saving operation in a while loop. How can I make sure that this doesn’t happen again?? Is there a way to check if there’s a file lock, or if there’s a process still writing to it?

Thanks

it saves like this

torch.save(colonne_out, ‘0.pt’)

after saving the file a lot of times, I tried to load the file but it gave an error

colonne_out = torch.load(pickle_file, map_location=torch.device(device)

RuntimeError Traceback (most recent call last)

in ()
1 # se sono offline, carico il file delle colonne di output
----> 2 colonne_out = torch.load(pickle_file, map_location=torch.device(device))

1 frames

/usr/local/lib/python3.7/dist-packages/torch/serialization.py in init(self, name_or_buffer)
240 class _open_zipfile_reader(_opener):
241 def init(self, name_or_buffer) → None:
→ 242 super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
243
244

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

ptrblck · November 17, 2021, 8:42am

I cannot reproduce the issue using this small example:

# 4GB
x = torch.randn(1024**3)

for _ in range(10):
    torch.save(x, "tmp.pt")
    y = torch.load("tmp.pt")

Could you check, if you are hitting the same error using my code?

Giorgio · November 18, 2021, 9:29am

Thanks for your feedback I’ve made it run long enough on google colab and I got no error. Since it happened to me at least once a week, I wonder where the error came from. Anyway, how does the corruption happen? Is there a way to check if the data has been written, or the only solution is trying to load it to check it?

ptrblck · November 18, 2021, 10:26am

Often file corruptions are caused by writes to the same files from multiple processes. Since you have apparently excluded this issue I don’t know what might be causing this error as I haven’t seen it before.