Memory Leak Debugging and Common Causes

Just wanted to make a thread with some information I wish I found before spending 4 hours trying to debug a memory leak. Most of the memory leak threads I found were unhelpful so I wanted to throw together a few tips here.

  1. causes of leaks:
    i) most threads talk about leaks caused by creating an array that holds tensors, if you continually add tensors to this array, you will at some point fill up the memory.
    ii) something i didn’t see mentioned is Autograd leaks, i.e. if you do a computation with a tensor and store it somewhere that never gets back-propped, you will never clear the computational graph and so the computational graph just keeps growing and growing. In my case I was measuring solution sparsity with a penalty function that was never used for backprop, I was then calculating the exponential running average of this which is why even after penalty would get garbage collected, the computational graph for the average remained. This issue can be avoided by using .detach() for any tensor computation that isn’t strictly for training the network.
  2. torch.cuda.empty_cache() (in most cases) is nothing more than a bandaid, its not going to fix the underlying issue though it may delay the error for a while by clearing other stuff while ignoring the actual problem
  3. the most useful way I found to debug is to use torch.cuda.memory_allocated() and torch.cuda.max_memory_allocated() to print a percent of used memory at the top of the training loop. Then look at your training loop, add a continue statement right below the first line and run the training loop. If your memory usage holds steady, move the continue to the next line and so on until you find the leak.

happy leak hunting

31 Likes

Thanks a lot. having a clearer title would help alot imho. sth like, “how to find and fix a possible memory leak” or “what I found helpful in fixing a memory leak” or things like this .
Anyway enjoyed this and thank you for this.

1 Like

Another one, a mix between 1.i) and 1.ii): if you append tensors with computed gradients to python lists for tracking purposes, the gradients also get inserted in the list and it grows a bit more than expected!

Also, leaks can find their way in computer memory (RAM, not GPU mem), so it can be useful to log RAM usage as well during training.

4 Likes

How does one log RAM usage during training? Does gc also include RAM usage? For instance, does the following code correctly log RAM usage?

    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
                print(type(obj), obj.size())
        except:
            pass

I don’t know about gc, but here’s what I’ve used: psutil.virtual_memory().percent. You can use other metrics than the free percentage, see the doc here.

1 Like

I’m having trouble finding my memory leak, and I’m trying your 3rd tip which is using the continue after each line and check. I have a small question about it: if we continue right after a forward call, should the memory consumption stay constant? Here is my code:

y_pred, y_est = model[model_id](x)
print(torch.cuda.memory_allocated() / torch.cuda.max_memory_allocated())
continue

The forward call is the first thing in the training loop, and the memory starts to explode. Is this expected or does this mean the leak is likely inside the call? Thank you.

Yeah, the goal is to just isolate each line individually until you find the part with the memory leak. If you put the continue above that line without issue, but below it there’s a leak then that’s your problem. If I were to guess this looks like an autograd memory leak i.e. pytorch is storing each calculation step so it can calculate the gradient of the loss but if you never actually do the gradient step, it just continually stores a record of all calculations.

Try using a “with no gradient:” statement above your forward call to check if that’s the issue.

Thanks for the prompt reply, but when I run with the wrapper torch.no_grad(), this error occurs:

File "main_pred.py", line 145, in <module>
    train_res = train_model(train_loader, optim, epoch, args.epochs, writer, model, args, weight_balancing, device)
  File "/home/chris/CSD_graph_detection/modules/utils.py", line 321, in train_model
    return eval_model(loader, optim, epoch, epochs, writer, model, args, weight_balancing, device, True)
  File "/home/chris/CSD_graph_detection/modules/utils.py", line 228, in eval_model
    loss.backward()
  File "/home/chris/anaconda3/envs/CSD/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/chris/anaconda3/envs/CSD/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Do you have any idea?

Thanks a lot for the tips, Charles! It never occurred to me that the computational graph was occupying the memory, thanks for the reminder!

Thank you very much for this useful summary.

hi,I have same problem as ii) .When I am using save_tensor, I have some layer forward, but this layer do not backward. The memory would be leaked. I could not use .detach() . Do you have any function to slove the problem.

This is so helpful!!! Using max_memory_allocated() to debug this. Saved me hours of time!!!

Thank you for this thread. I was having issues with my training step because the model would occupy my entire RAM and would just freeze mid-training. After reading your thread and looking carefully at my code, I noticed my custom loss function wasn’t using detach on the tensors I was creating, and thus it was freezing everything!

This thread was super useful in spotting my memory leak. Based on Charles’s suggestion, I made a class that attempts to spot the position of the memory leak automatically:


class LeakFinder:

    def __init__(self):
        self.step = 0  # used to keep track of the step in the batch
        self.batch = 0  # used to keep track of the batch
        self.values = {}
        self.predict_every = 20  # how often to predict the leak position
        self.verbose = True  # print the predicted leak position

    def set_batch(self, epoch):
        """
        Set the batch number
        """
        self.batch = epoch
        self.step = 0
        self.values[epoch] = {}

    def get_cuda_perc(self):

        # get the percentage of cuda memory used
        perc = torch.cuda.memory_allocated() / torch.cuda.max_memory_allocated()
        self.values[self.batch][self.step] = perc * 100

        self.step += 1

    def predict_leak_position(self, diffs, per_epoch_remainder):
        # train a tree regressor to predict the per epoch increase
        from sklearn.tree import DecisionTreeRegressor
        from sklearn.model_selection import train_test_split
        from sklearn.metrics import mean_squared_error
        from sklearn.preprocessing import MinMaxScaler

        # insert a zero at the start of  per_epoch_remainder
        per_epoch_remainder = torch.cat([torch.tensor([0]), per_epoch_remainder])

        # scale the data to be between 0 and 1
        x_scaler = MinMaxScaler()
        diffs = x_scaler.fit_transform(diffs)

        y_scaler = MinMaxScaler()
        per_epoch_remainder = y_scaler.fit_transform(per_epoch_remainder.reshape(-1, 1))

        # train test split
        X_train, X_test, y_train, y_test = train_test_split(diffs, per_epoch_remainder, test_size=0.1, random_state=42)

        # train regressor
        regressor = DecisionTreeRegressor(random_state=0)
        regressor.fit(X_train, y_train)

        # predict
        y_pred = regressor.predict(X_test)

        # calculate error
        mse = mean_squared_error(y_test, y_pred)
        mag = mse / per_epoch_remainder.mean() * 100
        print(f"MSE: {mse} ({mag:.2f}%)")

        # find the most important feature
        feature_importance = regressor.feature_importances_
        most_important_feature = torch.argmax(torch.tensor(feature_importance))
        print(f"Likely leak position between step {most_important_feature} and step {most_important_feature + 1}")

    def find_leaks(self):
        """
        Find leaks in the training loop
        """

        if self.batch < 2:
            return

        if not self.verbose and self.batch % self.predict_every != 0:
            return

        # estimate per step diff
        diffs = []
        for epoch, values in self.values.items():
            dif = []
            for step in range(1, len(values)):
                dif += [values[step] - values[step - 1]]
            diffs.append(dif)

        lens = [len(x) for x in diffs]
        min_lens = min(lens)

        per_epoch_increase = [self.values[epoch][min_lens - 1] - self.values[epoch][0] for epoch in self.values.keys()
                              if epoch > 0]
        between_epoch_decrease = [self.values[epoch][0] - self.values[epoch - 1][min_lens - 1] for epoch in
                                  self.values.keys() if epoch > 0]
        per_epoch_increase = torch.tensor(per_epoch_increase)
        between_epoch_decrease = torch.tensor(between_epoch_decrease)

        per_epoch_remainder = per_epoch_increase + between_epoch_decrease

        per_epoch_increase_mean = per_epoch_remainder.mean()
        per_epoch_increase_sum = per_epoch_remainder.sum()

        diffs = torch.tensor(diffs)

        print(
            f"Per epoch increase: {per_epoch_increase_mean:.2f}% cuda memory "
            f"(total increase of {per_epoch_increase_sum:.2f}%) currently at "
            f"{self.values[self.batch][min_lens - 1]:.2f}% cuda memory")

        if self.batch % self.predict_every == 0:
            self.predict_leak_position(diffs, per_epoch_remainder)

You can put it in your training loop as such:


leakfinder = LeakFinder()
for batch in dataset:
    leakfinder.set_batch(batch)
    
    # do stuff
    leakfinder.get_cuda_perc()
    
    # do more stuff
    leakfinder.get_cuda_perc()
    
    # do even more stuff
    leakfinder.get_cuda_perc()
    
    # find leaks
    leakfinder.find_leaks()