PyTorch Geometric Creating your own dataset


I am new to PyTorch and I have one task: my objective is to upload the personally collected data to the PyTorch. I am working with the PyTorch Geometric library extension.

So I have some problems with understanding the following code:

import os.path as osp

import torch
from import Dataset

class MyOwnDataset(Dataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(MyOwnDataset, self).__init__(root, transform, pre_transform)

    def raw_file_names(self): #Point 1
        return ['some_file_1', 'some_file_2', ...]

    def processed_file_names(self): #Point 2
        return ['', '', ...]

    def download(self): #Point 3
        # Download to `self.raw_dir`.

    def process(self): #Point 4
        i = 0
        for raw_path in self.raw_paths:
            # Read data from `raw_path`.
            data = Data(...)

            if self.pre_filter is not None and not self.pre_filter(data):

            if self.pre_transform is not None:
                data = self.pre_transform(data)

  , osp.join(self.processed_dir, 'data_{}.pt'.format(i)))
            i += 1

    def len(self):
        return len(self.processed_file_names)

     def get(self, idx):
         data = torch.load(osp.join(self.processed_dir, 'data_{}.pt'.format(idx)))
         return data

I have marked the points I would like to discuss. So Point 1 and Point 2 are the names of the files I want to upload and then obtain.
But what exactly are we doing at the Point 3, Pont 4 and after them? I have been trying to understand it also by looking at the similar source codes, but I cannot get it.
Could somebody explain it to me in the pain words? With some examples it would be perfect.

Thank you in advance!


I’m not familiar with the real implementation of this dataset, but based on the function names I would assume this:

  • raw_file_names: returns the file names as a list of “raw” input data, which is not yet processed in any way
  • processed_file_names: returns the file names of already processed names. Based on the file extension I assume these processed files are stored as PyTorch tensors (or at least via PyTorch).
  • download: downloads the raw files to the self.raw_dir most likely from a remote server.
  • process: iterates all raw files, loads and processes them. Afterwards it’s storing the processes data tensors to the selfprocessed_dir.
  • len: returns the number of processed files
  • get: loads a processed file from the self.processed_dir using the passed idx and returns it.