Custom multiprocessing data generation

ramazan · January 11, 2023, 2:35pm

Hi,

I’m trying to figure out how to create a custom dataloader with a following capabilities:

It should take a list of datasets and a list of number of workers for each dataset
When generating a batch, it should perform parallelized data generation using specific number of workers for each dataset.
It should create child processes (workers) just once and re-use them all along.
It should support samplers like a usual Dataloader.

Motivation:
Since I have several dataset classes with different generation types: one is fast – just loading images from disk, the other one is slower – uses GAN on GPU to generate samples, I need to dedicate different number of workers for each dataset. And since second dataset have heavy GAN, we want to spawn workers just once.

For example, suppose we have:

datasets = [FileDataset(), GanDataset()]
n_workers = [2, 4]
custom_dl = CustomDataloader(datasets, n_workers)

Every time next(iter(custom_dl)) called, dataloader should send async requests for datasets to start generating and wait until datasets are done.

So, what is a good way to write a dataloader with such logic?