PyTorch Geometric Creating your own dataset

Saguaro · February 8, 2021, 8:41pm

Hi!

I am new to PyTorch and I have one task: my objective is to upload the personally collected data to the PyTorch. I am working with the PyTorch Geometric library extension.

So I have some problems with understanding the following code:

import os.path as osp

import torch
from torch_geometric.data import Dataset


class MyOwnDataset(Dataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(MyOwnDataset, self).__init__(root, transform, pre_transform)

    @property
    def raw_file_names(self): #Point 1
        return ['some_file_1', 'some_file_2', ...]

    @property
    def processed_file_names(self): #Point 2
        return ['data_1.pt', 'data_2.pt', ...]

    def download(self): #Point 3
        # Download to `self.raw_dir`.

    def process(self): #Point 4
        i = 0
        for raw_path in self.raw_paths:
            # Read data from `raw_path`.
            data = Data(...)

            if self.pre_filter is not None and not self.pre_filter(data):
                continue

            if self.pre_transform is not None:
                data = self.pre_transform(data)

            torch.save(data, osp.join(self.processed_dir, 'data_{}.pt'.format(i)))
            i += 1

    def len(self):
        return len(self.processed_file_names)

     def get(self, idx):
         data = torch.load(osp.join(self.processed_dir, 'data_{}.pt'.format(idx)))
         return data

I have marked the points I would like to discuss. So Point 1 and Point 2 are the names of the files I want to upload and then obtain.
But what exactly are we doing at the Point 3, Pont 4 and after them? I have been trying to understand it also by looking at the similar source codes, but I cannot get it.
Could somebody explain it to me in the pain words? With some examples it would be perfect.

Thank you in advance!

Regards

ptrblck · February 9, 2021, 8:58am

I’m not familiar with the real implementation of this dataset, but based on the function names I would assume this:

raw_file_names: returns the file names as a list of “raw” input data, which is not yet processed in any way
processed_file_names: returns the file names of already processed names. Based on the file extension I assume these processed files are stored as PyTorch tensors (or at least via PyTorch).
download: downloads the raw files to the self.raw_dir most likely from a remote server.
process: iterates all raw files, loads and processes them. Afterwards it’s storing the processes data tensors to the selfprocessed_dir.
len: returns the number of processed files
get: loads a processed file from the self.processed_dir using the passed idx and returns it.