In one training loop, I am reading about 100000 feature files and each file’s size is 400KB. This is slow when the files are stored on a HDD, so I added @lru_cache annotation for the function reading feature files. However, it seems that in each iteration of DataLoader object, new worker processes will be created (if num_workers > 0), so the cached data in old worker processes is useless in the next iteration.
Is there a way to reuse the created worker processes, or is there an option for caching in DataLoader object? What is the best practice to read a huge amount of small sized files with PyTorch?
The thinking of using multiprocessing module to share objects between worker processes indeed works, thank you.
As the feature files I used has a huge total size, and cannot be identified simply with index, I used modified pylru.lrucache object, registered it in multiprocessing.managers.BaseManager and shared the cached content between processes. I’ll post the example later.
I don’t think so. If you set it to False, the cache will be populated with your tensors.
Passing it as True will instead get the data from the internal “cache”.
Probably I should change the if-statement for better readability.
After The first epoch, I call loader.dataset.set_use_cache(True) so that the following epochs will just get the data from the already cached tensors.
Yeah, thank you, that’s much better. By the way, how can I also store the x length in the same shared_array[index] or to generalize the question, how can I store multiple values?
A more general version might be the usage of a shared dict. Otherwise you could need to somehow pack the additional information into your array.
Would that work for you?
Thanks for the quick reply!
Is there any advantantage of using multiprocessing.Array? over a standard numpy array and allowing multiple workers to access the data at the index they are trying to access?
You would basically load the data once and reuse it for all workers.
That’s not the standard work flow, where you are usually loading e.g. image files and apply some live data augmentation on it, so I would recommend to not use the shared array unless you really need it.
If I can preload all data into a container in a proper way (not using dataset or dataloader), and then use the container as a cache for the dataset where the data will be read-only in rest execution. Is it still necessary to use the
I tried to replace the manager.dict with a python standard dict. Nothing wired happens for the snippet.
I printed the object id in __getitem__ and only observe a unique id for both two cases. Guessing dict is not duplicated in my cases.
The fear is that it works ok in a small snippet but breaks in complicated circumstances.
I have long been wondering the mechanism of our pytorch dataloader and dataset but still get confused after searching for informations.
According to the above github links, the fork/spawn methods might be the keys to dive into the details but this is diffcult for us users.
It is hoped that a in-depth tuitorial could be proposed on the dataset/dataloader mechenism. That would be helpful for users with a slightly advanced purpose.