In one training loop, I am reading about 100000 feature files and each file’s size is 400KB. This is slow when the files are stored on a HDD, so I added @lru_cache annotation for the function reading feature files. However, it seems that in each iteration of DataLoader object, new worker processes will be created (if num_workers > 0), so the cached data in old worker processes is useless in the next iteration.
Is there a way to reuse the created worker processes, or is there an option for caching in DataLoader object? What is the best practice to read a huge amount of small sized files with PyTorch?
Have a look at this small example on how to use shared arrays for multiple workers and let me know, if that works for you.
The thinking of using multiprocessing module to share objects between worker processes indeed works, thank you.
As the feature files I used has a huge total size, and cannot be identified simply with index, I used modified
pylru.lrucache object, registered it in
multiprocessing.managers.BaseManager and shared the cached content between processes. I’ll post the example later.
I’m not sure how to use your snippet. It has two questionable lines.
use_cache boolean value be reversed (i.e., initialized with True instead)
loader.dataset.set_use_cache(True), how are you executing it after the data looping?
I don’t think so. If you set it to
False, the cache will be populated with your tensors.
Passing it as
True will instead get the data from the internal “cache”.
Probably I should change the if-statement for better readability.
After The first epoch, I call
loader.dataset.set_use_cache(True) so that the following epochs will just get the data from the already cached tensors.
Yeah, thank you, that’s much better. By the way, how can I also store the
x length in the same
shared_array[index] or to generalize the question, how can I store multiple values?
A more general version might be the usage of a shared dict. Otherwise you could need to somehow pack the additional information into your array.
Would that work for you?
Why do we first create a mp.Array and then convert it to numpy array? Can’t we directly create a numpy array to create a torch.Tensor from it?
This use case is more or less an edge case and we are creating the
multiprocessing.Array so that we can share it between multiple processes.
If you don’t need this functionality, you are fine using standard numpy arrays and convert them to tensors via
Thanks for the quick reply!
Is there any advantantage of using multiprocessing.Array? over a standard numpy array and allowing multiple workers to access the data at the index they are trying to access?
The initial post describes this special use case.
You would basically load the data once and reuse it for all workers.
That’s not the standard work flow, where you are usually loading e.g. image files and apply some live data augmentation on it, so I would recommend to not use the shared array unless you really need it.
Hi there, are you able to share some example code? I’m in a similar situation where I want to share the lru_cache across workers. Thanks!
Thank you for the pytorch_misc repository.
1. For the Read-write purpose:
Just wonder what will happen if we replace
with common python
dict in the snippet.
2. For the Read-only purpose:
If I can preload all data into a container in a proper way (not using dataset or dataloader), and then use the container as a cache for the dataset where the data will be read-only in rest execution. Is it still necessary to use the
or we can turn to the normal
- I tried to replace the
manager.dict with a python standard dict. Nothing wired happens for the snippet.
- I printed the object id in
__getitem__ and only observe a unique id for both two cases. Guessing dict is not duplicated in my cases.
The fear is that it works ok in a small snippet but breaks in complicated circumstances.
I have long been wondering the mechanism of our pytorch dataloader and dataset but still get confused after searching for informations.
According to the above github links, the fork/spawn methods might be the keys to dive into the details but this is diffcult for us users.
It is hoped that a in-depth tuitorial could be proposed on the dataset/dataloader mechenism. That would be helpful for users with a slightly advanced purpose.