Is there a dataset functionality similar to PyTorch Geometrics preprocess skipping?


so I’ve used Pytorch-Geometric for a while and have now returned to PyTorch.
Something that I haven’t found in PyTorch, that i have grown very accustomed to in Pytorch-Geometric, is the possibility to skip preprocessing on Datasets, after you’ve done it once.
And i was wondering, if there is a similar functionality here and I’m just missing it, or if i have to do everything myself.

But what am I talking about exactly?
Well, there is a good example of it in their documentation and it is as follows.

import torch
from import InMemoryDataset

class MyOwnDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(MyOwnDataset, self).__init__(root, transform, pre_transform), self.slices = torch.load(self.processed_paths[0])

    def raw_file_names(self):
        return ['some_file_1', 'some_file_2', ...]

    def processed_file_names(self):
        return ['']

    def download(self):
        # Download to `self.raw_dir`.

    def process(self):
        # Read data into huge `Data` list.
        data_list = [...]

        data, slices = self.collate(data_list), slices), self.processed_paths[0])

So here, the interface provides two things. A directory for the raw data and a directory for the processed data.
The dataset will execute the download function, if the raw-directory is empty.
And when the process method has been executed, the data is being saved into the processed directory. So that, if the data has already been preprocessed, it will just be loaded from a file every time i reload the dataset.

Is there a functionality like that?

Hi @Sascha, I believe you can use InMemoryDataset with vanilla PyTorch.

Datasets in PyTorch require only 3 functions [ __init__(), __len__(), __getitem__()]. Since torch_geometric.Dataset contains those 3 functions, you should be able to pass its subclasses to a vanilla PyTorch DataLoader.


I’ll try that, thank you.
Although i hoped, that there would have been a way, without me having to install the Pytorch Geometric package into every environment, just because I like their dataset interface.