[Performance] Map style vs iterable dataset in PT for Remote storage

rn91 · September 24, 2020, 9:35pm

A map-style dataset in Pytorch has the __getitem__() and __len__() and iterable-style datasets has __iter__() method. From what I understand, map-style is used when you know dataset length and iterable is used when you do not.

I have a use case where I want to use data from remote storage device where I have different types of datasets… 1) bunch of Tar files, 2) bunch of zip files and also 3) bunch of small images.(100K) I am looking into writing custom dataset but getting confused about which one to use. Should I use map-style or iterable dataset. What is more popularly used?

I believe iterabledataset has cons that you cannot use distributedsampler with it but also has advantage that you can stream data directly without calling each object from list.

Does iterable dataset provide any more advantages than map-style in my specific case?
https://github.com/pytorch/pytorch/issues/38419 I also saw if I can use webdataset for my usecase… Wondering why they used iterabledataset instead of map-style. As iterable dataset does not work with distributedsampler

What will be better in terms of performance?

Can someone suggest? Thanks in advance!

@ptrblck @albanD