Hi,
I’m trying to figure out how to create a custom dataloader with a following capabilities:
- It should take a list of datasets and a list of number of workers for each dataset
- When generating a batch, it should perform parallelized data generation using specific number of workers for each dataset.
- It should create child processes (workers) just once and re-use them all along.
- It should support samplers like a usual Dataloader.
Motivation:
Since I have several dataset classes with different generation types: one is fast – just loading images from disk, the other one is slower – uses GAN on GPU to generate samples, I need to dedicate different number of workers for each dataset. And since second dataset have heavy GAN, we want to spawn workers just once.
For example, suppose we have:
datasets = [FileDataset(), GanDataset()]
n_workers = [2, 4]
custom_dl = CustomDataloader(datasets, n_workers)
Every time next(iter(custom_dl)) called, dataloader should send async requests for datasets to start generating and wait until datasets are done.
So, what is a good way to write a dataloader with such logic?