Dataloader and memmaps

HelixPiano · May 24, 2023, 12:01pm

Hello everyone,
I am currently getting some problems and I wonder if this is because of the interaction of the dataloader and numpy memmaps.
When I run the dataloader with num_workers=0 I get no errors. If I run it with num_workers=1 I suddenly get errors. It appears that the disk usage is very high and it looks like I am running out of RAM.
I wonder if num_workers=1 (or larger) actually loads the numpy memmap into memory for whatever reason instead of using them as intended.
To be clear: I use the dataloader to access data from one giant memmap (~32GB).

ElsebaiyMohamed · May 24, 2023, 12:51pm

chatGPT answer

The behavior you’re experiencing with the dataloader and numpy memmaps when changing the num_workers parameter could indeed be related to memory usage and how data is loaded. Let’s explore some possible explanations and solutions:

Memory consumption: When you increase the number of workers (num_workers > 0), each worker will have its own memory space to load and process data. If each worker is loading the entire 32GB numpy memmap into memory, the total memory usage can quickly accumulate and potentially exceed the available RAM. This could lead to out-of-memory errors or high disk usage due to swapping.
Shared memory: By default, when using multiple workers, the dataloader uses inter-process communication to share the data between processes. This means that the numpy memmap could be accessed by each worker independently, resulting in redundant memory usage. This behavior can be controlled by setting multiprocessing_context to ‘fork’ or ‘spawn’ in the dataloader constructor to avoid redundant memory usage.
DataLoader implementation: The dataloader itself doesn’t directly control how the data is loaded from the memmap. It relies on the dataset object to handle the data loading. It’s possible that the implementation of your custom dataset is causing the entire memmap to be loaded into memory for each worker.

To address these issues, consider the following solutions:

Chunked loading: Instead of loading the entire memmap into memory, you can implement a mechanism to load data in smaller chunks or batches. This way, each worker will only load a fraction of the data, reducing memory consumption.
Modify the dataset: Review the implementation of your custom dataset and ensure that it loads only the necessary data from the memmap. You can use indexing or slicing operations to extract the required data for each sample.
Memory mapping options: Check the options you used when creating the numpy memmap. You can control the mode (e.g., ‘r’ for read-only) and the shape of the memory map to match your specific needs. Adjusting these options can help optimize memory usage.
Experiment with num_workers: If reducing the number of workers (num_workers) resolves the issue, it suggests that the memory usage is related to the number of processes loading the data. You can try finding a balance between performance and memory usage by experimenting with different values for num_workers.

By implementing these suggestions, you should be able to mitigate memory-related issues and optimize the interaction between the dataloader and numpy memmaps.

HelixPiano · May 24, 2023, 1:10pm

This is not really helpful.
As stated above:
The entire point of the numpy memmap file is that it actually does not get loaded into memory. I checked it already for num_workers=0.
Also “spawn” is the only available option on windows.

ElsebaiyMohamed · May 24, 2023, 8:24pm

The issue could arise from each worker concurrently loading a batch of data, in parallel with other workers. As a result, the memory capacity might be insufficient to accommodate more than one batch at a time.

HelixPiano · May 25, 2023, 7:49am

This is also not the case. Splitting the memmap into separate files worked. The actual solution to solving the problem (when you don’t want to split the memmap files) was moving the self.list_of_arrays into the get-function. I am not totally sure why, but it fixed the problem.

dancedpipi · December 27, 2023, 8:32am

Could you share how do you solve the problem? Are you using np.memmap in __getitem__?

ado_sar · March 23, 2024, 10:08pm

Can you please share the solution?

sad_robot · June 13, 2024, 8:53pm

This is just yet another instance of this problem.
He probably made the list of arrays a local variable (inside getitem) because sharing lists doesn’t work. Also sharing numpy arrays doesn’t work with the spawn method either…