Large memory copy/leak when using dataloaders with dictionaries and workers > 0

penguinshin · January 15, 2019, 1:35am

Hello, I made a custom dataset that gets all its examples from CPU-intensive operations on a single dictionary of lists of dictionaries (that does not need to be modified). To access a single example, the dataset has to access an item in the list of one of the dictionaries. When I feed this dataset into a dataloader with workers > 0, the memory usage increases by 10X which crashes my computer. This is not the case when I keep workers = 0, but I need the speedup provided by multiprocessing. My guess is that for some reason, the dictionary (which is about 15GB) is being copied over to various worker processes, but since my dictionary is read-only I would think that it could be shared. Does this sound like the reason? Is this possible to do? And how would you achieve it?

dashesy · January 15, 2019, 2:27am

Processes cannot share memory like threads do, and workers are processes. You can use mmap for example and share an in-memory file and keep your dictionary there.

penguinshin · January 15, 2019, 3:53am

Ok, I will try that. I noticed that the memory usage doesnt go up instantly, (although it does go up very quickly). Basically, it doesn’t go up fast enough to indicate that its copying over the entire dictionary to each worker. Makes me wonder if its only copying over the first level’s appropriate key, i.e. if my overall dictionary has 20 keys, and the appropriate data for a particular example lies in key 14, then only the values coresponding to key 14 will be copied over- does this sound right?

dashesy · January 15, 2019, 9:09pm

There is no copying data (not the entire data). When a process forks the memory will be used as virtual memory of the new process as COW memory, as soon as the new process writes to it there will be a page fault that actually allocates the physical memory (copy). I guess when you change the dictionary the entire dictionary will be page-faulted, you will see memory jump as soon as the new processes change the dictionary. Instead of a dictionary, you can try multiple global variable.

penguinshin · January 15, 2019, 9:26pm

So, I’m not modifying the dictionary at all during this process, so is there still writing going on? I’m just using the dictionary as a data source, but the processing I’m doing does not modify it at all (to the best of my knowledge).

dashesy · January 16, 2019, 1:24am

It could be this GC issue as described by Instagram. It looks like ~~~this is fixed~~~ there is gc.freeze() in Python 3.7, ~~~so what is your version?~~~ that can be used before fork to benefit from COW semantics.
You can try disabling gc and see if you see the same work-around. Disabling GC has other issues, specially with cyclic references though.

I think think it is safer/easier to use mmap or numpy.memmap if it is an array.

penguinshin · January 19, 2019, 8:45pm

Ok thanks Dashesy, very interesting. I tried gc.disable() and it doesn’t fix the problem. I think there is some problem with handling my object since its arbitrary json. I think I will go for the numpy memmap array approach

Liang · March 6, 2020, 8:37pm

I have a possible solution for a similar problem. I also store a large variable in the dataloader object, and I came across the same memory problem. There is one bug on PyTorch: if you set num_worker < physical cpu kernels, it works fine; otherwise, pytorch seems to replicate the dataloader object in memory, which lead to memory leak.

In short, there is a BUG in PyTorch, try to set num_worker < physical cpu kernels.