PyG dataloder with previously stored data

AleTL · February 18, 2022, 7:15pm

Hello everyone!

I’m working with a really huge dataset of small graphs. What I’ve done so far is creating a Dataset class that stores the torch_geometric.data.Data() graph objects in a processed folder. What I want is to be able to read those files with a Dataset class without generate them again. How can I do it? It seems pretty silly but I’m wasting a lot of time and I don’t find the way.

The idea is to generate the Data() objects if they haven’t been previously generated and just read it if they already exists. I’m trying to do so creating a Dataset without download and process but it doesn’t work.

Thank you!

ptrblck · February 19, 2022, 7:45am

I don’t know which format is used to store the generated torch_geometric data but assume you can also load each sample with torch_geometric in a similar way.
If so, I would guess you could write a custom Dataset as described here and load each sample in the __getitem__.

AleTL · February 21, 2022, 8:42am

Hi!

Thank you for your answer! I’ve tried that but it doesn’t work because a standard Pytorch Dataset needs just tensor data, it’s not able to handle torch_geometric.data.Data() format.
I’m now trying to do it with a PyG Dataset, that can obviously handle PyG.Data objects.

Thanks!

AleTL · February 22, 2022, 1:13pm

In case someone is curious, what I’ve finally done is to load the graph files without any dataset, just like it appears on the PyG tutorial.

from torch_geometric.data import Data
from torch_geometric.loader import DataLoader

data_list = [Data(…), …, Data(…)]
loader = DataLoader(data_list, batch_size=32)