Dataset wrapper zoo

wolterlw · February 13, 2020, 12:59am

A lot of my time working on any research project is spent figuring building Dataset subclasses to wrap publicly available datasets, also most of the “official implementation of *” repositories have different wrappers for the same datasets that basically do the same thing, but each time you have to figure out exactly what transformations are applied, how the data is sampled etc., which is repetitive and probably the least appealing part of work.

PyTorch has already a generally standardized way of implementing dataset classes and transformers, so creating a single repository for people to contribute wrappers of publicly available datasets doesn’t seem like a stretch. Even if a particular project needs data packaged differently it’s still way better to have a starting point.

I don’t think I’m the first one to come up with the idea, so my question is why Dataset Wrapper Zoo is not a thing yet.

ptrblck · February 13, 2020, 5:19am

torchvision.datasets contains a lot of public datasets or wrappers for common datasets such as ImageNet.
You could of course contribute new datasets to it.

I’m not sure what kind of data you are using, but it it’s not a vision dataset, you might want to e.g. contribute to torchtext.

wolterlw · February 13, 2020, 5:55am

As far as I understand datasets provided in torchvision.datasets are supposed to be available for setup in a single function call.
What about datasets that need you to first register for access, than download a bunch of archives using your credentials and extract them.
Such datasets need something like a short readme with installation instructions and sometimes even have their own dependencies.
Are these suitable for torchvision.datasets?

ptrblck · February 13, 2020, 5:59am

I think they might still be suitable, if they will make your life easier once you get the dataset.
ImageNet is such an example.
There was a period, where you could directly download the dataset, but if I’m not mistaken, the download link was taken down, so that you would need to register now in order to download the dataset.

wolterlw · February 15, 2020, 8:55am

OK, thank you.
There’s currently an unresolved issue about how to best structure dataset wrappers, so that is probably what creates friction for potential contributors.

ptrblck · February 15, 2020, 8:58am

Could you post a link to the issue?
Also, maybe your idea of a community based dataset zoo could fit in the torch hub?

CC @ailzhang

wolterlw · February 15, 2020, 9:49am

So far it seems like reaching consensus proved to be difficult as both datasets and tasks they’re used for vary a lot.
IMO a good starting point would be to first collect a bunch of contributed datasets someplace, possibly torch hub as you’ve suggested, to get a feel of how they’re used. After that it would be easier to create a robust abstraction.
Because now every research repo has it’s own Datasets and custom DataLoaders. These are often times a pain to work with when you try to combine datasets from several different repos.

ptrblck · February 15, 2020, 9:52am

Yes, that’s true and also maintaining/creating a dataset might be quite a lot of work.
What is your feeling about the number of stale datasets “in the wild”?

wolterlw · February 15, 2020, 9:55am

A LOT of datasets still only have Matlab code samples, so yeah…