What is the correct way of using cycle, shuffle, shading and seed for datapipes and dataloader2?

Nabarun_Goswami · March 10, 2023, 11:52am

I have a few questions. First the scenario:

Let’s say I have a bunch of files that I want to cycle infinitely and also use shuffling+sharding for distributed/multiprocessing reading services.

Is this the proper flow?

dp = dp.cycle()
dp = dp.shuffle()
dp = dp.sharding_filter()

Regarding the iterator of Dataloader2 and setting seed, is this flow correct?

dist_rs = DistributedReadingService()
mp_rs = MultiProcessingReadingService(num_workers=num_workers)
rs = SequentialReadingService(dist_rs, mp_rs)
dl = DataLoader2(dp, reading_service=rs)
dl_iter = iter(dl)

for epoch in range(10):
    dl.seed(epoch)
    for i in range(100):
        batch = next(dl_iter)

Finally, does setting the dl.seed(epoch) cause the datapipe to be re-shuffled and re-sharded? And does it recreate all multiprocessing workers, or are the workers persistent?

Thanks in advance!

Benjamin_Rhodes · June 26, 2023, 9:29am

Bump. I also would like to have clarification on this.