How to combine multiple torch.load

Ripper346 · March 15, 2021, 4:24pm

Hi, I am trying to make a custom Dataset with pytorch geometric. I need to replace the last row of __init__ of the class in Creating Your Own Datasets — pytorch_geometric 1.6.3 documentation to adapt it for multiple loads because I have several files to load in the same dataset.
From what I understood (I can’t find in the docs), torch.load(file_path) returns a tuple with Data and slice. How can I combine these two for multiple files keeping the object structure? It is a little hard to explain it, I hope you understand my problem. Thank you

patrickwilliams3 · March 15, 2021, 4:51pm

It looks like you can pass an io.BytesIO() into torch.load. I would try using the io.BytesIO to read from multiple files and then load that into torch.load() with a singular buffer.

Screen Shot 2021-03-15 at 11.50.02 AM

Ripper346 · March 15, 2021, 5:10pm

Ok, but I only moved the problem, how can I concatenate those BytesIO? I tried this

buffer = io.BytesIO()
for file in self.processed_paths:
    with open(file, 'rb') as f:
        buffer.write(f.read())
self.data, self.slices = torch.load(buffer)

but I get an EOFError when loading

patrickwilliams3 · March 15, 2021, 5:49pm

Try doing buffer.seek(0) before torch.load()

Ripper346 · March 15, 2021, 6:21pm

With seek changed the error. Now I have:

RuntimeError: [enforce fail at ..\caffe2\serialize\inline_container.cc:145] . PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted

It kind makes sense of this error, I only concatenated some buffers. I don’t think that it is a well built structure for torch to load, isn’t it? The files are working independently, this isn’t a file divided in multiple files for some reason.
Let’s say I have a.pt and b.pt, inside a.pt I have a dataset, inside b.pt I have another that I have to merge together. If I run torch.load('a.pt') it works, same thing for b.pt.