Hello,
I’m working on creating a Dataset on roughly 100,000 graph samples I’ve created using pytorch geometric’s libraries.
I roughly followed the guide here:
Creating Your Own Datasets — pytorch_geometric 2.0.4 documentation.
My data in total, after preprocessing, weighs roughly 14G so I think this is probably too large to be an InMemoryDataset so I opted to work on it as a regular dataset.
So my processed folder now contains 100k samples, each sample saved in its own file. Now this was obviously a bad choice. To iterate over my whole dataset is extremely costly as I have to open and close 100k files but I’m unsure of how to batch my data up into larger files while also somehow keeping track of what file contains what indices. Even if I did do such a batching how would my “get” function know to keep these files open until I’m done with the one file? Obviously there are many issues with what I’m suggesting which is why I never went through with it.
So does anyone have any ideas of how I can deal with this issue?
Thanks