I have a few questions. First the scenario:
Let’s say I have a bunch of files that I want to cycle infinitely and also use shuffling+sharding for distributed/multiprocessing reading services.
Is this the proper flow?
dp = dp.cycle()
dp = dp.shuffle()
dp = dp.sharding_filter()
Regarding the iterator of Dataloader2 and setting seed, is this flow correct?
dist_rs = DistributedReadingService()
mp_rs = MultiProcessingReadingService(num_workers=num_workers)
rs = SequentialReadingService(dist_rs, mp_rs)
dl = DataLoader2(dp, reading_service=rs)
dl_iter = iter(dl)
for epoch in range(10):
dl.seed(epoch)
for i in range(100):
batch = next(dl_iter)
Finally, does setting the dl.seed(epoch)
cause the datapipe to be re-shuffled and re-sharded? And does it recreate all multiprocessing workers, or are the workers persistent?
Thanks in advance!