Data loader multiprocessing architecture

Vinayaka_Hegde · January 12, 2023, 6:50am

I wanted to deep-dive and understand the internal architecture of the data loader. So, I started off with the source code and tried to understand dataloader.py. I understood the type of datasets and the action of sampler based on these datasets. I also understood about the multprocessDataLoading and how the worker processes are created and how the indices are populated into the index queue and how the worker processes send data to the global data-queue. I wanted a more in-depth understanding on the multiprocessing data fetching part. I have questions like :

How the worker processes consume from the index queue? Are the indices consumed in sequential order? random order?
How the worker processes write into the data queue? Are they written sequentially or parallely?
How the data loader fetches and outputs from the data queue? How the mini batches constructed ?

Mainly, I wanted to get a detailed picture on the multiprocessingdata loading part and the workers interaction with data queue, index queue and scheduling, order in which data moves to/from worker to the queues. And how the data loader constructs the mini batches and outputs it ? Since all workers execute parallely.

I couldnt find any videos/blogs explaining the dataloader multiprocessing architecture (in-depth). Please point me to some resources or feel free to explain it here

ptrblck · January 13, 2023, 6:13am

Please don’t tag specific people as it might discourage others to answer and the tagged ones might not know all answers directly. In particular I cannot give detailed answers and would refer you to these docs in case you haven’t seen them already (I guess you did).