Hello everyone,
I am currently getting some problems and I wonder if this is because of the interaction of the dataloader and numpy memmaps.
When I run the dataloader with num_workers=0 I get no errors. If I run it with num_workers=1 I suddenly get errors. It appears that the disk usage is very high and it looks like I am running out of RAM.
I wonder if num_workers=1 (or larger) actually loads the numpy memmap into memory for whatever reason instead of using them as intended.
To be clear: I use the dataloader to access data from one giant memmap (~32GB).
chatGPT answer
The behavior youâre experiencing with the dataloader and numpy memmaps when changing the num_workers
parameter could indeed be related to memory usage and how data is loaded. Letâs explore some possible explanations and solutions:
- Memory consumption: When you increase the number of workers (
num_workers
> 0), each worker will have its own memory space to load and process data. If each worker is loading the entire 32GB numpy memmap into memory, the total memory usage can quickly accumulate and potentially exceed the available RAM. This could lead to out-of-memory errors or high disk usage due to swapping. - Shared memory: By default, when using multiple workers, the dataloader uses inter-process communication to share the data between processes. This means that the numpy memmap could be accessed by each worker independently, resulting in redundant memory usage. This behavior can be controlled by setting
multiprocessing_context
to âforkâ or âspawnâ in the dataloader constructor to avoid redundant memory usage. - DataLoader implementation: The dataloader itself doesnât directly control how the data is loaded from the memmap. It relies on the dataset object to handle the data loading. Itâs possible that the implementation of your custom dataset is causing the entire memmap to be loaded into memory for each worker.
To address these issues, consider the following solutions:
- Chunked loading: Instead of loading the entire memmap into memory, you can implement a mechanism to load data in smaller chunks or batches. This way, each worker will only load a fraction of the data, reducing memory consumption.
- Modify the dataset: Review the implementation of your custom dataset and ensure that it loads only the necessary data from the memmap. You can use indexing or slicing operations to extract the required data for each sample.
- Memory mapping options: Check the options you used when creating the numpy memmap. You can control the mode (e.g., ârâ for read-only) and the shape of the memory map to match your specific needs. Adjusting these options can help optimize memory usage.
- Experiment with
num_workers
: If reducing the number of workers (num_workers
) resolves the issue, it suggests that the memory usage is related to the number of processes loading the data. You can try finding a balance between performance and memory usage by experimenting with different values fornum_workers
.
By implementing these suggestions, you should be able to mitigate memory-related issues and optimize the interaction between the dataloader and numpy memmaps.
This is not really helpful.
As stated above:
The entire point of the numpy memmap file is that it actually does not get loaded into memory. I checked it already for num_workers=0.
Also âspawnâ is the only available option on windows.
The issue could arise from each worker concurrently loading a batch of data, in parallel with other workers. As a result, the memory capacity might be insufficient to accommodate more than one batch at a time.
This is also not the case. Splitting the memmap into separate files worked. The actual solution to solving the problem (when you donât want to split the memmap files) was moving the self.list_of_arrays into the get-function. I am not totally sure why, but it fixed the problem.