Is there a dataset functionality similar to PyTorch Geometrics preprocess skipping?

Sascha · September 23, 2020, 8:14am

Hi

so I’ve used Pytorch-Geometric for a while and have now returned to PyTorch.
Something that I haven’t found in PyTorch, that i have grown very accustomed to in Pytorch-Geometric, is the possibility to skip preprocessing on Datasets, after you’ve done it once.
And i was wondering, if there is a similar functionality here and I’m just missing it, or if i have to do everything myself.

But what am I talking about exactly?
Well, there is a good example of it in their documentation and it is as follows.

import torch
from torch_geometric.data import InMemoryDataset


class MyOwnDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(MyOwnDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return ['some_file_1', 'some_file_2', ...]

    @property
    def processed_file_names(self):
        return ['data.pt']

    def download(self):
        # Download to `self.raw_dir`.

    def process(self):
        # Read data into huge `Data` list.
        data_list = [...]

        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

So here, the interface provides two things. A directory for the raw data and a directory for the processed data.
The dataset will execute the download function, if the raw-directory is empty.
And when the process method has been executed, the data is being saved into the processed directory. So that, if the data has already been preprocessed, it will just be loaded from a file every time i reload the dataset.

Is there a functionality like that?

suraj.pt · September 23, 2020, 2:11pm

Hi @Sascha, I believe you can use InMemoryDataset with vanilla PyTorch.

Datasets in PyTorch require only 3 functions [ __init__(), __len__(), __getitem__()]. Since torch_geometric.Dataset contains those 3 functions, you should be able to pass its subclasses to a vanilla PyTorch DataLoader.

Sascha · September 24, 2020, 11:22am

I’ll try that, thank you.
Although i hoped, that there would have been a way, without me having to install the Pytorch Geometric package into every environment, just because I like their dataset interface.