I’m currently trying to understand the dataloader architecture of pytorch as a part of my research. It’s very important for me to gain minute level understanding on few things such as:
- how the sampler yields indices to the dataloader
- how the _MultiProcessingDataLoaderIter class works internally
- How the _index_queues for every worker are populated
- Why exactly the _task_info queue is used when we already have index_queues and _data queue to store indices to be processed and final data respectively. Is it because, to store out of order indices from the sampler?
- Why do we need _rcvd_idx and _send-idx parameters? Are these just counts? Does rcvd_idx - send_idx provide the number of elements that are being processed ? What’s the use of these exactly ?
- Why are we calling _try_put_index again in _process_data? Since we already initialized the prefetch_factor * num_workers number of indices to the workers?
- After prefetching a batch, how do we load the next pre-fetch batch ?
- Is there any reason for the data-queue to store (send_idx, data) ?
And multiple such minute questions about the architecture of dataloader and what’s happening internally…
I’ve searched a lot about dataloader architecture implementation, but couldnt find any resources. And have gone through the source code for dataloader, fetch, sampler, dataset, worker and other files.
I sincerely request you to please point me any resources / contacts which would help me in understanding the architecture of data loader.
Do you suggest me to do like a dry run by using a dummy input and try to make sense of the values at every step in the code ?