Memory Leak Debugging and Common Causes

Just wanted to make a thread with some information I wish I found before spending 4 hours trying to debug a memory leak. Most of the memory leak threads I found were unhelpful so I wanted to throw together a few tips here.

  1. causes of leaks:
    i) most threads talk about leaks caused by creating an array that holds tensors, if you continually add tensors to this array, you will at some point fill up the memory.
    ii) something i didn’t see mentioned is Autograd leaks, i.e. if you do a computation with a tensor and store it somewhere that never gets back-propped, you will never clear the computational graph and so the computational graph just keeps growing and growing. In my case I was measuring solution sparsity with a penalty function that was never used for backprop, I was then calculating the exponential running average of this which is why even after penalty would get garbage collected, the computational graph for the average remained. This issue can be avoided by using .detach() for any tensor computation that isn’t strictly for training the network.
  2. torch.cuda.empty_cache() (in most cases) is nothing more than a bandaid, its not going to fix the underlying issue though it may delay the error for a while by clearing other stuff while ignoring the actual problem
  3. the most useful way I found to debug is to use torch.cuda.memory_allocated() and torch.cuda.max_memory_allocated() to print a percent of used memory at the top of the training loop. Then look at your training loop, add a continue statement right below the first line and run the training loop. If your memory usage holds steady, move the continue to the next line and so on until you find the leak.

happy leak hunting

15 Likes

Thanks a lot. having a clearer title would help alot imho. sth like, “how to find and fix a possible memory leak” or “what I found helpful in fixing a memory leak” or things like this .
Anyway enjoyed this and thank you for this.

1 Like

Another one, a mix between 1.i) and 1.ii): if you append tensors with computed gradients to python lists for tracking purposes, the gradients also get inserted in the list and it grows a bit more than expected!

Also, leaks can find their way in computer memory (RAM, not GPU mem), so it can be useful to log RAM usage as well during training.

3 Likes

How does one log RAM usage during training? Does gc also include RAM usage? For instance, does the following code correctly log RAM usage?

    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
                print(type(obj), obj.size())
        except:
            pass

I don’t know about gc, but here’s what I’ve used: psutil.virtual_memory().percent. You can use other metrics than the free percentage, see the doc here.

1 Like

I’m having trouble finding my memory leak, and I’m trying your 3rd tip which is using the continue after each line and check. I have a small question about it: if we continue right after a forward call, should the memory consumption stay constant? Here is my code:

y_pred, y_est = model[model_id](x)
print(torch.cuda.memory_allocated() / torch.cuda.max_memory_allocated())
continue

The forward call is the first thing in the training loop, and the memory starts to explode. Is this expected or does this mean the leak is likely inside the call? Thank you.

Yeah, the goal is to just isolate each line individually until you find the part with the memory leak. If you put the continue above that line without issue, but below it there’s a leak then that’s your problem. If I were to guess this looks like an autograd memory leak i.e. pytorch is storing each calculation step so it can calculate the gradient of the loss but if you never actually do the gradient step, it just continually stores a record of all calculations.

Try using a “with no gradient:” statement above your forward call to check if that’s the issue.

Thanks for the prompt reply, but when I run with the wrapper torch.no_grad(), this error occurs:

File "main_pred.py", line 145, in <module>
    train_res = train_model(train_loader, optim, epoch, args.epochs, writer, model, args, weight_balancing, device)
  File "/home/chris/CSD_graph_detection/modules/utils.py", line 321, in train_model
    return eval_model(loader, optim, epoch, epochs, writer, model, args, weight_balancing, device, True)
  File "/home/chris/CSD_graph_detection/modules/utils.py", line 228, in eval_model
    loss.backward()
  File "/home/chris/anaconda3/envs/CSD/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/chris/anaconda3/envs/CSD/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Do you have any idea?

Thanks a lot for the tips, Charles! It never occurred to me that the computational graph was occupying the memory, thanks for the reminder!

Thank you very much for this useful summary.