A guide to recovering from CUDA Out of Memory and other exceptions

This thread is to explain and help sort out the situations when an exception happens in a jupyter notebook and a user can’t do anything else without restarting the kernel and re-running the notebook from scratch. This usually happens when CUDA Out of Memory exception happens, but it can happen with any exception.

The problem comes from ipython, which stores locals() in the exception’s traceback and thus prevents general and GPU memory from being released.

Currently there are 2 solutions to this problem:

  1. stripping tb from locals() before the exception is passed to ipython (preemptive)
  2. raising a 2nd, simple local exception, like 1/0 in the notebook, which resets tb (reactive)

There will be better solutions once ipython sorts this out. The difficulty is to continue supporting the %debug magic. You can also follow the discussion here.

Please read the guide https://docs.fast.ai/troubleshoot.html#memory-leakage-on-exception which explains the problem in details and provides concrete solutions. You can skip the fastai-specific section of the guide if it’s not relevant to you and just read the introduction and the custom solutions sections. If after reading the guide you have any questions or difficulties with applying the information please ask the questions in this dedicated thread.

I will update this post once we have a resolution from the ipython dev team (which could take a while).