Is inheriting from Dataset necessary for the creation of a custom dataset?

kuzand · February 26, 2020, 6:27am

I am creating my own dataset class which takes a path to a csv file containing image paths and labels:

class MyDataset():
    def __init__(self, csv_path):
        ...

    def __len__(self):
        return len(...)

    def __getitem__(self, index):
        ...
        return (img, label)

As you can see it contains the methods __init__, __getitem__ and __len__.

Now, this is the source code for the Pytorch’s Dataset:

class Dataset(object):
    def __getitem__(self, index):
        raise NotImplementedError

    def __add__(self, other):
        return ConcatDataset([self, other])

and in docs it says that:

All datasets that represent a map from keys to data samples should subclass it

But I don’t see anything special except the __add__ method which I think in my case is not needed (otherwise I could write my own). Is it still necessary to inherit from Dataset after having implemented my own __getitem__ and __len__, to be able to create a dataloader later on? What advantage is there from subclassing it?

tom · February 26, 2020, 7:30am

Will it work without subclassing Dataset? Probably in most cases.
Do you get anything from skipping subclassing Dataset? Probably not.
Will it bite you when third party code tests isinstance(ds, Dataset) to see if it got a dataset or if someone use type hinting with your code? Yes.

Best regards

Thomas

vedal · June 1, 2022, 9:25am

Does subclassing from Dataset give any performance gains compared to skipping subclassing?

tom · June 1, 2022, 9:38am

The performance difference should be negligible if at all measurable.

Best regards

Thomas

vedal · June 1, 2022, 11:25am

Thank you for the swift and clear answer @tom