Memory is not released in Jupyter after KeyboardInterrupt

suvcss · August 26, 2017, 7:09pm

Every time I manually interrupt training, some memory remains stuck. For example, interrupting this tutorial - http://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html during training phase consumes about 500mb. Memory is not connected to any objects, deleting everything in the notebook’s scope doesn’t release memory. Interrupting training second time adds the same amount of leaked memory to the “pool”. Restarting kernel helps:slight_smile:
Is there any way to release the memory or reset the graph similar to tf.reset_default_graph()?

SpandanMadan · August 26, 2017, 8:22pm

Yes, I face the same problem with jupyter. Restarting kernel also does not do the trick. In fact, jupyter starts a bunch of processes with IDs in a sequence. They won’t even show up on your nvidia-smi command.

The only way i’ve found to find them is use

sudo fuser -v /dev/nvidia*

Then, kill the processes using PID.

OR just kill all ipython kernel stuff using

pkill -f ipykernel

suvcss · August 26, 2017, 10:13pm

I tested it on CPU, maybe because of this the restart did help

suvcss · August 27, 2017, 10:32am

Also repeatedly deep coping model to the same variable leads to memory consumption.

SpandanMadan · August 27, 2017, 5:58pm

potentially deleting that might resolve the issue! Any way that you know to delete it?

suvcss · August 27, 2017, 7:11pm

Noting specific to Pytorch:( Del and gc.collect() don’t help.

tom · August 28, 2017, 8:33am

I think Jupyter caches outputs in a way that adds references to the results.
That said, I have been using Jupyter with pytorch for half a year now and have not run into the things you describe…

Best regards

Thomas

suvcss · August 28, 2017, 9:50am

Thomas, please, could you check it on your machine? It will take only a few minutes:) Run this tutorialhttp://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
and interrupt training several times after maybe 10 seconds. Will your memory consumption grow?

suvcss · August 29, 2017, 12:48pm

Tried few other things. The problem appears to be connected with how the Jupyter works. Would appreciate any ideas:)

QuantScientist · August 29, 2017, 1:50pm

Have you tried:
%whos
and then:
%reset -f
and finally:
%whos

SpandanMadan · September 2, 2017, 6:09am

I tried this. This doesn’t release the GPU memory.

But this works -

pkill -f ipykernel

This will kill your kernel too, so it will be like restarting the notebook.

suvcss · September 2, 2017, 10:24am

reset -f doesn’t help to release memory, just clear every variable:)

chenchr · September 2, 2017, 7:48pm

In the main page of the jupyter notebook you open, you can choose the ipynb you’ve just run, and on the top click shutdown, then I think the memory should be released.

suvcss · September 3, 2017, 9:08am

I know that I can shutdown or restart notebook to release memory:) The question is - how it can be done without restarting or why it happens?

tom · September 3, 2017, 12:51pm

Have you tested whethet the same happens if you run one full pass before the partial ones?
By design, pytorch caches allocations.

Best regards

Thomas

QuantScientist · September 3, 2017, 8:33pm

I now have the exact same issue:

cuda runtime error (2) : out of memory at /home/gpu/dev/rt/pytorch/torch/lib/THC/generic/THCStorage.cu:66

Thrown after repeated training during which I stopped the kernel and started training again.
Of-course PyTorch returns False for torch.cuda.is_available()
Neither torch nor Jupyter are able to recover from this. The only solution for me was to restart the computer since the memory is not released otherwise.

BTW I am using this command line for monitoring:
watch -n 0.1 'ps f -o user,pgrp,pid,pcpu,pmem,start,time,command -psudo lsof -n -w -t /dev/nvidia*'
(source: https://stackoverflow.com/questions/8223811/top-command-for-gpus-using-cuda)

Sia_Rezaei · December 3, 2020, 7:30pm

I’m having the same issue. I’m developing a new model and naturally I get a lot of crashes as I’m debugging. Every time of have to restart the kernel, run the pre-processing steps, … It’s a pain in the neck.