Very consitent memory leak


#1

Hi Everybody,
I am seeing a very consistent memory leak when training a model with pytorch. Every epoch I am loosing 108k+/- 6k pages of cpu memory. I tried with numworker = 0 and 4 and with and without GPU in all cases I am loosing about the same amount of memory each cycle. Finally after about 160 epoch’s my training will killed by the queuing system for exceeding the requested memory (if nworker=0) or crash with an error message like below or a different memory error (nworker>0):
File “/gstore/apps/Anaconda3/5.0.1/lib/python3.6/multiprocessing/process.py”, line 105, in start
self._popen = self._Popen(self)
File “/gstore/apps/Anaconda3/5.0.1/lib/python3.6/multiprocessing/context.py”, line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File “/gstore/apps/Anaconda3/5.0.1/lib/python3.6/multiprocessing/context.py”, line 277, in _Popen
return Popen(process_obj)
File “/gstore/apps/Anaconda3/5.0.1/lib/python3.6/multiprocessing/popen_fork.py”, line 20, in init
self._launch(process_obj)
File “/gstore/apps/Anaconda3/5.0.1/lib/python3.6/multiprocessing/popen_fork.py”, line 67, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

I have counted the number and type of python objects in each epoch following this suggestion:
https://tech.labs.oliverwyman.com/blog/2008/11/14/tracing-python-memory-leaks/
and found that after about 5 epochs the number of python object does not change at all. So my conclusion is that the leak might be in the torch C code.

Any ideas how I could further debug this?
Thanks,
Alberto

Some more information in case this helps:

  • My training involves a variable number of records per minibatch
  • My training set size is 114989 “objects” in minibatches of 1500
  • The network is simple but some what unconventional: each “object” will result in multiple (varying 0-20) records passed through multiple fully connected layers. The results are then “averaged” before computing the final score and the loss function
  • I tried adding torch.backend.cudnn.enalbled = False with no effect.
  • I am using pytorch 0.4.0 on linux

vmSize change from one epoch to next (note the last dip is due to the system crashing):
vmSize


#2

Is your model on the GPU or CPU? If switching to the other one doesn’t OOM then there’s probably a memory leak somewhere.

In general my advice is to try to narrow down where the memory leak is by deleting code until the memory stops leaking.


#3

Hi @richard
thanks for looking at this. The memory leak is happening with GPU and CPU,
Additionally, I tried with numworker=0 and 4 in all 4 combinations I am getting similar memory leaks.
A colleague suggested switching of eval mode. Aside of that I will have to think what can be switched off.
Alberto


#4

I tried some more things:

  • disabled eval mode
  • removed a non standard regularization step
  • replaced the non standard activation function with an out of the box ELU

In all cases I see the same consistent memory leak of ~108k pages of memory per epoch.

Any additional ideas on how to analyze this would be welcome.
Alberto


#5

I solved the memory leak. I found the solution in the FAQ:
https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory
Note: this issue affect GPU and CPU memory.

Alberto