I was tracing down a memory leak that kept showing up when 2 models are running on the same computer, but training on 2 different gpus (with 2 workers each for the data loaders). That’s when I learned that I should only be using numpy arrays or pytorch tensors for all instance variables inside the workers, instead of python objects. Once I re-wrote all of that code, the leak is much smaller, but still there. BUT, only when 2 models are running on the same machine. When one model is running, there is no leak. I suspect it has to do with the multiprocessing, because there is no leak for 2 models if i set the num_workers to 0.
To clarify, each model is trained using the same code, independently (i.e., running one script per model).
1 model: no leak (+0GB at 10+ days)
2 models: ~ +2GB/day of ram usage
I don’t have a clue how to chase this down, because there is both a leak and no leak. Any suggestions?
So, I’m testing setting pin_memory=False to both my train and evaluation dataloaders. The first thing that happened is with an epoch it complains about running out of shared memory for the workers (nothing else changed). And, right now the ram usage doesn’t seem to be growing. Will check back in 12hrs to see if that remains true.
So far it looks that that’s the conclusion: pinning the memory of two jobs on the same machine leads to leaks. I’ve found some vague mentions of over-allocating using cuda pinned memory, but no hard and fast rules. With one job there should be somewhere between 0.4GB and 1.6GB pinned at any one time (I’m assuming the pinning queue allows any ready workers to start pinning). Which only means 3.2GB for two jobs. On a machine with 96GB (and at least 15GB free), that seems like way too small to be hitting some allocation limit.