Memory is not released in Jupyter after KeyboardInterrupt

Every time I manually interrupt training, some memory remains stuck. For example, interrupting this tutorial - http://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html during training phase consumes about 500mb. Memory is not connected to any objects, deleting everything in the notebook’s scope doesn’t release memory. Interrupting training second time adds the same amount of leaked memory to the “pool”. Restarting kernel helps:slight_smile:
Is there any way to release the memory or reset the graph similar to tf.reset_default_graph()?

4 Likes

Yes, I face the same problem with jupyter. Restarting kernel also does not do the trick. In fact, jupyter starts a bunch of processes with IDs in a sequence. They won’t even show up on your nvidia-smi command.

The only way i’ve found to find them is use

sudo fuser -v /dev/nvidia*

Then, kill the processes using PID.

OR just kill all ipython kernel stuff using

pkill -f ipykernel
3 Likes

I tested it on CPU, maybe because of this the restart did help

1 Like

Also repeatedly deep coping model to the same variable leads to memory consumption.

1 Like

potentially deleting that might resolve the issue! Any way that you know to delete it?

Noting specific to Pytorch:( Del and gc.collect() don’t help.

I think Jupyter caches outputs in a way that adds references to the results.
That said, I have been using Jupyter with pytorch for half a year now and have not run into the things you describe…

Best regards

Thomas

Thomas, please, could you check it on your machine? It will take only a few minutes:) Run this tutorialhttp://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
and interrupt training several times after maybe 10 seconds. Will your memory consumption grow?

Tried few other things. The problem appears to be connected with how the Jupyter works. Would appreciate any ideas:)

Have you tried:
%whos
and then:
%reset -f
and finally:
%whos

I tried this. This doesn’t release the GPU memory.

But this works -

pkill -f ipykernel

This will kill your kernel too, so it will be like restarting the notebook.

reset -f doesn’t help to release memory, just clear every variable:)

In the main page of the jupyter notebook you open, you can choose the ipynb you’ve just run, and on the top click shutdown, then I think the memory should be released.

I know that I can shutdown or restart notebook to release memory:) The question is - how it can be done without restarting or why it happens?

2 Likes

Have you tested whethet the same happens if you run one full pass before the partial ones?
By design, pytorch caches allocations.

Best regards

Thomas

I now have the exact same issue:

cuda runtime error (2) : out of memory at /home/gpu/dev/rt/pytorch/torch/lib/THC/generic/THCStorage.cu:66

Thrown after repeated training during which I stopped the kernel and started training again.
Of-course PyTorch returns False for torch.cuda.is_available()
Neither torch nor Jupyter are able to recover from this. The only solution for me was to restart the computer since the memory is not released otherwise.

BTW I am using this command line for monitoring:
watch -n 0.1 'ps f -o user,pgrp,pid,pcpu,pmem,start,time,command -psudo lsof -n -w -t /dev/nvidia*'
(source: https://stackoverflow.com/questions/8223811/top-command-for-gpus-using-cuda)

I’m having the same issue. I’m developing a new model and naturally I get a lot of crashes as I’m debugging. Every time of have to restart the kernel, run the pre-processing steps, … It’s a pain in the neck.