What features in DataPipes replaces in DataLoader

The torchdata package has been very cool to work through.
Looking at GitHub - pytorch/data: A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries., I am having a hard time understanding what specifically the DataPipeline is replacing in dataloaders

  • Who should have batching?

In the example readme we have:

import torchdata.datapipes as dp
datapipes1 = dp.iter.FileOpener([‘a.file’, ‘b.file’]).map(fn=decoder).shuffle().batch(2)
…I’m wondering if the purpose of the DataLoader when datapipes are used is to just serve as a multiprocessing shell

datapipes2 = dp.iter.Batcher(datapipes2, 2)

So is the intention that “batching” be defined by the datapipe as opposed to the DataLoader?

  • Forking/multiprocessing is handled?
    The usage of a “Fork” function in the DataPipes. This gives me the impression you could combine this with multiprocessing/threading?

Is the intention that we can customize how multiprocessing is handled?

  • shuffling
    Is the intention that we no longer need to pass any shuffling functions to the DataLoader? This is going to be moved to a DataPipeline?

Are there other features in DataLoader that are being/planned to be replaced by DataPipelines? What features are intended to remain in DataLoader? I’m imagining primarily/only multiprocessing?

1 Like

We are currently working on a new version of DataLoader, more details will come out in the coming months. We created a RFC a year ago to solicit feedback.

At the high level, the plan is that DataLoader V2 will only be responsible for multiprocessing, distributed, and similar functionalities, not data processing logic. All data processing features, such as the shuffling and batching, will be moved out of DataLoader to DataPipe. At the same time, the current/old version of DataLoader should still be available and you can use DataPipes with that as well.

By having a DataPipe for batching, it allows users to precisely define when batching happens, it also can prevent partial batches being dropped. Another type of batching, bucketbatch, is also available.

Similarly, the shuffler allows you to define when shuffling should take place. It also uses a buffer, which allows you to shuffle datasets that may not fit into memory.

The fork DataPipe is unrelated to multiprocessing; it just allows you to have separate one DataPipe into two DataPipes, which is necessary for some use cases. A similar DataPipe, demux, is basically fork with filter.

This is extremely cool and answers my question well. The RFC link is also very cool / kind of fun to read. I’ll follow that as it progresses. For my project I’ll experiment with a custom dataloader to see how datapipes could simplify it.