I have implemented an
IterableDataset that generates synthetic data according to a source of randomness in numpy. I need to control the generation process, thus I manually seed the random source. My code looks like:
import numpy as np import torch from torch.utils.data import IterableDataset, DataLoader class RandDataset(IterableDataset): def __init__(self, seed=42): super(RandDataset).__init__() self.seed = seed self.data_rng = np.random.default_rng(seed) def __iter__(self): while True: yield self.data_rng.normal() def collect_data(loader, epochs=10): DATA =  for idx, data in enumerate(loader): if idx==epochs: break DATA.append(data) return DATA DATA0 = collect_data(DataLoader(RandDataset(seed=42), batch_size=32))
My actual data-generating procedure is much more CPU-intensive than self.data_rng.normal(). To speed up data generation, I am using a multiprocessing data loader. However, this generates duplicate batches from the two workers.
dl = DataLoader(RandDataset(seed=42), num_workers=2) DATA2_rep = collect_data(dl) torch.all(DATA2_rep == DATA2_rep) # result: tensor(True)
My current solution is to seed each worker differently according to the worker id.
I ended up doing something like:
def seed_worked(worker_id): worker_info = torch.utils.data.get_worker_info() dataset = worker_info.dataset worker_id = worker_info.id dataset.data_rng = np.random.default_rng(dataset.seed + 1000*worker_id) dl = DataLoader(MyIterableDataset(seed=42), num_workers=2, worker_init_fn=seed_worked) DATA2_norep = collect_data(DataLoader(RandDataset(seed=42), num_workers=2, batch_size=32, worker_init_fn=seed_worked))
The sequence of generated batches DATA2_norep seems reproducible on my machine, and it does not contain repeated instances. However, the result depends on the number of workers by design.
My preferred solution would be to have a single shared source of randomness accessed
sequentially by the workers, so that I would get the same sequence with 1 or more workers. Is that possible?