Memory Leak Debugging and Common Causes

cdhernandez · January 22, 2020, 6:29am

Just wanted to make a thread with some information I wish I found before spending 4 hours trying to debug a memory leak. Most of the memory leak threads I found were unhelpful so I wanted to throw together a few tips here.

causes of leaks:
i) most threads talk about leaks caused by creating an array that holds tensors, if you continually add tensors to this array, you will at some point fill up the memory.
ii) something i didn’t see mentioned is Autograd leaks, i.e. if you do a computation with a tensor and store it somewhere that never gets back-propped, you will never clear the computational graph and so the computational graph just keeps growing and growing. In my case I was measuring solution sparsity with a penalty function that was never used for backprop, I was then calculating the exponential running average of this which is why even after penalty would get garbage collected, the computational graph for the average remained. This issue can be avoided by using .detach() for any tensor computation that isn’t strictly for training the network.
torch.cuda.empty_cache() (in most cases) is nothing more than a bandaid, its not going to fix the underlying issue though it may delay the error for a while by clearing other stuff while ignoring the actual problem
the most useful way I found to debug is to use torch.cuda.memory_allocated() and torch.cuda.max_memory_allocated() to print a percent of used memory at the top of the training loop. Then look at your training loop, add a continue statement right below the first line and run the training loop. If your memory usage holds steady, move the continue to the next line and so on until you find the leak.

happy leak hunting

Shisho_Sama · January 22, 2020, 6:49am

Thanks a lot. having a clearer title would help alot imho. sth like, “how to find and fix a possible memory leak” or “what I found helpful in fixing a memory leak” or things like this .
Anyway enjoyed this and thank you for this.

alex.veuthey · January 22, 2020, 7:07am

Another one, a mix between 1.i) and 1.ii): if you append tensors with computed gradients to python lists for tracking purposes, the gradients also get inserted in the list and it grows a bit more than expected!

Also, leaks can find their way in computer memory (RAM, not GPU mem), so it can be useful to log RAM usage as well during training.

RylanSchaeffer · June 28, 2020, 6:54pm

How does one log RAM usage during training? Does gc also include RAM usage? For instance, does the following code correctly log RAM usage?

    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
                print(type(obj), obj.size())
        except:
            pass

alex.veuthey · June 30, 2020, 6:30am

I don’t know about gc, but here’s what I’ve used: psutil.virtual_memory().percent. You can use other metrics than the free percentage, see the doc here.

ChrisLiu2 · September 3, 2020, 2:15am

I’m having trouble finding my memory leak, and I’m trying your 3rd tip which is using the continue after each line and check. I have a small question about it: if we continue right after a forward call, should the memory consumption stay constant? Here is my code:

y_pred, y_est = model[model_id](x)
print(torch.cuda.memory_allocated() / torch.cuda.max_memory_allocated())
continue

The forward call is the first thing in the training loop, and the memory starts to explode. Is this expected or does this mean the leak is likely inside the call? Thank you.

cdhernandez · September 3, 2020, 8:00am

Yeah, the goal is to just isolate each line individually until you find the part with the memory leak. If you put the continue above that line without issue, but below it there’s a leak then that’s your problem. If I were to guess this looks like an autograd memory leak i.e. pytorch is storing each calculation step so it can calculate the gradient of the loss but if you never actually do the gradient step, it just continually stores a record of all calculations.

Try using a “with no gradient:” statement above your forward call to check if that’s the issue.

ChrisLiu2 · September 3, 2020, 1:58pm

Thanks for the prompt reply, but when I run with the wrapper torch.no_grad(), this error occurs:

File "main_pred.py", line 145, in <module>
    train_res = train_model(train_loader, optim, epoch, args.epochs, writer, model, args, weight_balancing, device)
  File "/home/chris/CSD_graph_detection/modules/utils.py", line 321, in train_model
    return eval_model(loader, optim, epoch, epochs, writer, model, args, weight_balancing, device, True)
  File "/home/chris/CSD_graph_detection/modules/utils.py", line 228, in eval_model
    loss.backward()
  File "/home/chris/anaconda3/envs/CSD/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/chris/anaconda3/envs/CSD/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Do you have any idea?

sriram_B · January 3, 2021, 12:54pm

Thanks a lot for the tips, Charles! It never occurred to me that the computational graph was occupying the memory, thanks for the reminder!

marxqiu · December 30, 2021, 8:24am

Thank you very much for this useful summary.

encichou · July 28, 2022, 1:48pm

hi，I have same problem as ii) .When I am using save_tensor, I have some layer forward, but this layer do not backward. The memory would be leaked. I could not use .detach() . Do you have any function to slove the problem.

Shawn-Shan · November 18, 2022, 4:39am

This is so helpful!!! Using max_memory_allocated() to debug this. Saved me hours of time!!!

Deadrosas · March 8, 2023, 10:10am

Thank you for this thread. I was having issues with my training step because the model would occupy my entire RAM and would just freeze mid-training. After reading your thread and looking carefully at my code, I noticed my custom loss function wasn’t using detach on the tensors I was creating, and thus it was freezing everything!

nicofirst1 · May 18, 2023, 3:23pm

This thread was super useful in spotting my memory leak. Based on Charles’s suggestion, I made a class that attempts to spot the position of the memory leak automatically:


class LeakFinder:

    def __init__(self):
        self.step = 0  # used to keep track of the step in the batch
        self.batch = 0  # used to keep track of the batch
        self.values = {}
        self.predict_every = 20  # how often to predict the leak position
        self.verbose = True  # print the predicted leak position

    def set_batch(self, epoch):
        """
        Set the batch number
        """
        self.batch = epoch
        self.step = 0
        self.values[epoch] = {}

    def get_cuda_perc(self):

        # get the percentage of cuda memory used
        perc = torch.cuda.memory_allocated() / torch.cuda.max_memory_allocated()
        self.values[self.batch][self.step] = perc * 100

        self.step += 1

    def predict_leak_position(self, diffs, per_epoch_remainder):
        # train a tree regressor to predict the per epoch increase
        from sklearn.tree import DecisionTreeRegressor
        from sklearn.model_selection import train_test_split
        from sklearn.metrics import mean_squared_error
        from sklearn.preprocessing import MinMaxScaler

        # insert a zero at the start of  per_epoch_remainder
        per_epoch_remainder = torch.cat([torch.tensor([0]), per_epoch_remainder])

        # scale the data to be between 0 and 1
        x_scaler = MinMaxScaler()
        diffs = x_scaler.fit_transform(diffs)

        y_scaler = MinMaxScaler()
        per_epoch_remainder = y_scaler.fit_transform(per_epoch_remainder.reshape(-1, 1))

        # train test split
        X_train, X_test, y_train, y_test = train_test_split(diffs, per_epoch_remainder, test_size=0.1, random_state=42)

        # train regressor
        regressor = DecisionTreeRegressor(random_state=0)
        regressor.fit(X_train, y_train)

        # predict
        y_pred = regressor.predict(X_test)

        # calculate error
        mse = mean_squared_error(y_test, y_pred)
        mag = mse / per_epoch_remainder.mean() * 100
        print(f"MSE: {mse} ({mag:.2f}%)")

        # find the most important feature
        feature_importance = regressor.feature_importances_
        most_important_feature = torch.argmax(torch.tensor(feature_importance))
        print(f"Likely leak position between step {most_important_feature} and step {most_important_feature + 1}")

    def find_leaks(self):
        """
        Find leaks in the training loop
        """

        if self.batch < 2:
            return

        if not self.verbose and self.batch % self.predict_every != 0:
            return

        # estimate per step diff
        diffs = []
        for epoch, values in self.values.items():
            dif = []
            for step in range(1, len(values)):
                dif += [values[step] - values[step - 1]]
            diffs.append(dif)

        lens = [len(x) for x in diffs]
        min_lens = min(lens)

        per_epoch_increase = [self.values[epoch][min_lens - 1] - self.values[epoch][0] for epoch in self.values.keys()
                              if epoch > 0]
        between_epoch_decrease = [self.values[epoch][0] - self.values[epoch - 1][min_lens - 1] for epoch in
                                  self.values.keys() if epoch > 0]
        per_epoch_increase = torch.tensor(per_epoch_increase)
        between_epoch_decrease = torch.tensor(between_epoch_decrease)

        per_epoch_remainder = per_epoch_increase + between_epoch_decrease

        per_epoch_increase_mean = per_epoch_remainder.mean()
        per_epoch_increase_sum = per_epoch_remainder.sum()

        diffs = torch.tensor(diffs)

        print(
            f"Per epoch increase: {per_epoch_increase_mean:.2f}% cuda memory "
            f"(total increase of {per_epoch_increase_sum:.2f}%) currently at "
            f"{self.values[self.batch][min_lens - 1]:.2f}% cuda memory")

        if self.batch % self.predict_every == 0:
            self.predict_leak_position(diffs, per_epoch_remainder)

You can put it in your training loop as such:


leakfinder = LeakFinder()
for batch in dataset:
    leakfinder.set_batch(batch)
    
    # do stuff
    leakfinder.get_cuda_perc()
    
    # do more stuff
    leakfinder.get_cuda_perc()
    
    # do even more stuff
    leakfinder.get_cuda_perc()
    
    # find leaks
    leakfinder.find_leaks()

xiaomutt · November 22, 2024, 1:39am

Thank God I read your post. I was trapped by 1(ii) and was pondering for a long time where is the memory leak. I was doing an exponential averaging on the loss each epoch, and the loss is a tensor with autograd! I fixed it by using loss.item().

Thank you for summarizing these!

MLangner · February 28, 2025, 6:59pm

I have another suggestion to add on the list, which happened to me just yesterday. My training loop was horribly slow, which I couldn’t find a reason for. RAM was constant, VRAM was constant and nothing pointed at a memory leak, and the training did not crash. It was just painfully slow. I added some assertions and if elses for debugging, more specifically for determining exploding gradients and or nans during training, which would have printed out the respective tensor and operation causing the issue. I used a python function instead of its pytorch sibling, e.g. “any” vs “torch.any”, which led to the whole content of VRAM being copied to CPU and system memory and back again, all once per step in the training loop. This caused massive overhead slowing down training.

As another point on the list, I would hence add to double-check and guarantee using the pytorch-equivalent functions for everything possible in order to prevent implicit copying back to system memory or between devices, more generally speaking.