DataLoader, IterableDataset and parallelization

ekrx · May 4, 2024, 5:23am

I’m using a generator wrapped on a IterableDataset then passed to a DataLoader. The data is being read at demand from the disk because there are thousands of files. So the IterableDataset has some logic to keep track of the open files and the cursors of each file and some funky compurations. If I want to create multiple workers to prefetch the data in the dataloader, I understand this is being created by creating processes. Hence this will cause the data to be read multiple times per epoch. Is this the right understanding?

class Dataset(IterableDataset):
    def __init__(self):
        self.x = np.random.rand(100,2)
        self.y = np.random.rand(100,1)

    def __iter__(self):
        return iter(self.generator())
    
    def generator(self):
        for x, y in zip(self.x, self.y):
            yield x, y

and

def main():
   train = Dataset()

    count = 0
    for batch in train:
        count += 1

    print(f'count={count}')

    train_loader = DataLoader(dataset=train, batch_size=3)

    count = 0
    for batch in train_loader:
        count += batch[0].shape[0]

    print (f'count={count}')
    train_loader = DataLoader(dataset=train, batch_size=3, prefetch_factor=10, num_workers=3)

    count = 0
    for batch in train_loader:
        count += batch[0].shape[0]

    print (f'count={count}')


if __name__ == '__main__':
    main()

prints the following:

count=100
count=100
count=300

Any recommendations how to prevent this or fix this?

ptrblck · May 5, 2024, 5:17pm

The docs show an example using the worker_info inside the __iter__ to avoid having duplicate data returned from all workers.