Guidance to understand the dataloader architecture in a minute level

Vinayaka_Hegde · January 18, 2023, 4:31pm

Hi,

I’m currently trying to understand the dataloader architecture of pytorch as a part of my research. It’s very important for me to gain minute level understanding on few things such as:

how the sampler yields indices to the dataloader
how the _MultiProcessingDataLoaderIter class works internally
How the _index_queues for every worker are populated
Why exactly the _task_info queue is used when we already have index_queues and _data queue to store indices to be processed and final data respectively. Is it because, to store out of order indices from the sampler?
Why do we need _rcvd_idx and _send-idx parameters? Are these just counts? Does rcvd_idx - send_idx provide the number of elements that are being processed ? What’s the use of these exactly ?
Why are we calling _try_put_index again in _process_data? Since we already initialized the prefetch_factor * num_workers number of indices to the workers?
After prefetching a batch, how do we load the next pre-fetch batch ?
Is there any reason for the data-queue to store (send_idx, data) ?

And multiple such minute questions about the architecture of dataloader and what’s happening internally…

I’ve searched a lot about dataloader architecture implementation, but couldnt find any resources. And have gone through the source code for dataloader, fetch, sampler, dataset, worker and other files.

I sincerely request you to please point me any resources / contacts which would help me in understanding the architecture of data loader.

Do you suggest me to do like a dry run by using a dummy input and try to make sense of the values at every step in the code ?

Thanks !