Dataset map-style vs iterable-style

aerinykim · August 10, 2020, 7:57pm

A map-style dataset in Pytorch has the __getitem__() and __len__() and iterable-style datasets has __iter__() protocol. If we use map-style, we can access the data with dataset[idx] which is great, however with the iterable dataset we can’t.

My question is why this distinction was necessary? What makes the data random read so expensive or even improbable?

ptrblck · August 12, 2020, 3:23am

I understood the main difference between these datasets, that the IterableDataset provides a clean way to yield data from e.g. a stream, i.e. where the length of the dataset is unknown or cannot be simply calculated. Inside the __iter__ method you would thus have to make sure to exit the iteration at some point, e.g. if your data stream is empty.

shartzog · August 12, 2020, 4:46am

It’s pretty much the same distinction as a generator (iterable-style) and a list (map-style), correct?

ptrblck · August 12, 2020, 6:31am

That’s my understanding, yes.