Processing large dataset

I’m working on creating a Dataset on roughly 100,000 graph samples I’ve created using pytorch geometric’s libraries.
I roughly followed the guide here:
Creating Your Own Datasets — pytorch_geometric 2.0.4 documentation.
My data in total, after preprocessing, weighs roughly 14G so I think this is probably too large to be an InMemoryDataset so I opted to work on it as a regular dataset.
So my processed folder now contains 100k samples, each sample saved in its own file. Now this was obviously a bad choice. To iterate over my whole dataset is extremely costly as I have to open and close 100k files but I’m unsure of how to batch my data up into larger files while also somehow keeping track of what file contains what indices. Even if I did do such a batching how would my “get” function know to keep these files open until I’m done with the one file? Obviously there are many issues with what I’m suggesting which is why I never went through with it.

So does anyone have any ideas of how I can deal with this issue?


Could you load data once per iteration rather than loading everything in to memory when creating Dataset.
We have a new repo to construct iterDataPipe to advocate iterator-style data loading methodology. GitHub - pytorch/data: A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
Let us know if this is going to help you

What I did was passing a train.csv containing the list of files I want to train with and returning this list in processed_file_names. In get(), I am iterating through the list and reading each of the .pt files. In my case, #samples are 63k graphs and each .pt file can be as big as 5MB. So, I zipped each of the pt files having the ‘data’ instance.

I tried InMemory dataset too, but no luck. The training has been extremely slow due to dataloading (~40 mins per epoch, same on RTX 3070 and RTX A4000) for batchsize of 64 and 27k training samples.