Reuse of DataLoader worker process and caching in DataLoader

In one training loop, I am reading about 100000 feature files and each file’s size is 400KB. This is slow when the files are stored on a HDD, so I added @lru_cache annotation for the function reading feature files. However, it seems that in each iteration of DataLoader object, new worker processes will be created (if num_workers > 0), so the cached data in old worker processes is useless in the next iteration.

Is there a way to reuse the created worker processes, or is there an option for caching in DataLoader object? What is the best practice to read a huge amount of small sized files with PyTorch?

1 Like

Have a look at this small example on how to use shared arrays for multiple workers and let me know, if that works for you.

2 Likes

The thinking of using multiprocessing module to share objects between worker processes indeed works, thank you.

As the feature files I used has a huge total size, and cannot be identified simply with index, I used modified pylru.lrucache object, registered it in multiprocessing.managers.BaseManager and shared the cached content between processes. I’ll post the example later.

1 Like

I’m not sure how to use your snippet. It has two questionable lines.

Shouldn’t the use_cache boolean value be reversed (i.e., initialized with True instead)

And loader.dataset.set_use_cache(True), how are you executing it after the data looping?

I don’t think so. If you set it to False, the cache will be populated with your tensors.
Passing it as True will instead get the data from the internal “cache”.
Probably I should change the if-statement for better readability. :wink:

After The first epoch, I call loader.dataset.set_use_cache(True) so that the following epochs will just get the data from the already cached tensors.

Yeah, thank you, that’s much better. By the way, how can I also store the x length in the same shared_array[index] or to generalize the question, how can I store multiple values?

A more general version might be the usage of a shared dict. Otherwise you could need to somehow pack the additional information into your array.
Would that work for you?

1 Like

Why do we first create a mp.Array and then convert it to numpy array? Can’t we directly create a numpy array to create a torch.Tensor from it?

This use case is more or less an edge case and we are creating the multiprocessing.Array so that we can share it between multiple processes.

If you don’t need this functionality, you are fine using standard numpy arrays and convert them to tensors via torch.from_numpy.

Thanks for the quick reply!
Is there any advantantage of using multiprocessing.Array? over a standard numpy array and allowing multiple workers to access the data at the index they are trying to access?

The initial post describes this special use case.

You would basically load the data once and reuse it for all workers.
That’s not the standard work flow, where you are usually loading e.g. image files and apply some live data augmentation on it, so I would recommend to not use the shared array unless you really need it.

1 Like

Hi there, are you able to share some example code? I’m in a similar situation where I want to share the lru_cache across workers. Thanks!

Thank you for the pytorch_misc repository.

1. For the Read-write purpose:

Just wonder what will happen if we replace

multiprocessing.Manager.dict

with common python dict in the snippet.

2. For the Read-only purpose:

If I can preload all data into a container in a proper way (not using dataset or dataloader), and then use the container as a cache for the dataset where the data will be read-only in rest execution. Is it still necessary to use the

multiprocessing.Manager.dict

or we can turn to the normal dict.

  1. I guess it won’t be working to share the same dict, as it would be copied. Did you try to replace it and run the code snippet?

  2. If you can preload the data, you could use it directly in a Dataset and the DataLoader would then create copies of it for each worker.

  1. I tried to replace the manager.dict with a python standard dict. Nothing wired happens for the snippet.
  2. I printed the object id in __getitem__ and only observe a unique id for both two cases. Guessing dict is not duplicated in my cases.

The fear is that it works ok in a small snippet but breaks in complicated circumstances.

I have long been wondering the mechanism of our pytorch dataloader and dataset but still get confused after searching for informations.

According to the above github links, the fork/spawn methods might be the keys to dive into the details but this is diffcult for us users.

It is hoped that a in-depth tuitorial could be proposed on the dataset/dataloader mechenism. That would be helpful for users with a slightly advanced purpose.