Just wanted to make a thread with some information I wish I found before spending 4 hours trying to debug a memory leak. Most of the memory leak threads I found were unhelpful so I wanted to throw together a few tips here.
causes of leaks:
i) most threads talk about leaks caused by creating an array that holds tensors, if you continually add tensors to this array, you will at some point fill up the memory.
ii) something i didn’t see mentioned is Autograd leaks, i.e. if you do a computation with a tensor and store it somewhere that never gets back-propped, you will never clear the computational graph and so the computational graph just keeps growing and growing. In my case I was measuring solution sparsity with a penalty function that was never used for backprop, I was then calculating the exponential running average of this which is why even after penalty would get garbage collected, the computational graph for the average remained. This issue can be avoided by using .detach() for any tensor computation that isn’t strictly for training the network.
torch.cuda.empty_cache() (in most cases) is nothing more than a bandaid, its not going to fix the underlying issue though it may delay the error for a while by clearing other stuff while ignoring the actual problem
the most useful way I found to debug is to use torch.cuda.memory_allocated() and torch.cuda.max_memory_allocated() to print a percent of used memory at the top of the training loop. Then look at your training loop, add a continue statement right below the first line and run the training loop. If your memory usage holds steady, move the continue to the next line and so on until you find the leak.
Thanks a lot. having a clearer title would help alot imho. sth like, “how to find and fix a possible memory leak” or “what I found helpful in fixing a memory leak” or things like this .
Anyway enjoyed this and thank you for this.
Another one, a mix between 1.i) and 1.ii): if you append tensors with computed gradients to python lists for tracking purposes, the gradients also get inserted in the list and it grows a bit more than expected!
Also, leaks can find their way in computer memory (RAM, not GPU mem), so it can be useful to log RAM usage as well during training.
How does one log RAM usage during training? Does gc also include RAM usage? For instance, does the following code correctly log RAM usage?
for obj in gc.get_objects():
try:
if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
print(type(obj), obj.size())
except:
pass
I don’t know about gc, but here’s what I’ve used: psutil.virtual_memory().percent. You can use other metrics than the free percentage, see the doc here.
I’m having trouble finding my memory leak, and I’m trying your 3rd tip which is using the continue after each line and check. I have a small question about it: if we continue right after a forward call, should the memory consumption stay constant? Here is my code:
The forward call is the first thing in the training loop, and the memory starts to explode. Is this expected or does this mean the leak is likely inside the call? Thank you.
Yeah, the goal is to just isolate each line individually until you find the part with the memory leak. If you put the continue above that line without issue, but below it there’s a leak then that’s your problem. If I were to guess this looks like an autograd memory leak i.e. pytorch is storing each calculation step so it can calculate the gradient of the loss but if you never actually do the gradient step, it just continually stores a record of all calculations.
Try using a “with no gradient:” statement above your forward call to check if that’s the issue.
Thanks for the prompt reply, but when I run with the wrapper torch.no_grad(), this error occurs:
File "main_pred.py", line 145, in <module>
train_res = train_model(train_loader, optim, epoch, args.epochs, writer, model, args, weight_balancing, device)
File "/home/chris/CSD_graph_detection/modules/utils.py", line 321, in train_model
return eval_model(loader, optim, epoch, epochs, writer, model, args, weight_balancing, device, True)
File "/home/chris/CSD_graph_detection/modules/utils.py", line 228, in eval_model
loss.backward()
File "/home/chris/anaconda3/envs/CSD/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/chris/anaconda3/envs/CSD/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
hi,I have same problem as ii) .When I am using save_tensor, I have some layer forward, but this layer do not backward. The memory would be leaked. I could not use .detach() . Do you have any function to slove the problem.
Thank you for this thread. I was having issues with my training step because the model would occupy my entire RAM and would just freeze mid-training. After reading your thread and looking carefully at my code, I noticed my custom loss function wasn’t using detach on the tensors I was creating, and thus it was freezing everything!
This thread was super useful in spotting my memory leak. Based on Charles’s suggestion, I made a class that attempts to spot the position of the memory leak automatically:
class LeakFinder:
def __init__(self):
self.step = 0 # used to keep track of the step in the batch
self.batch = 0 # used to keep track of the batch
self.values = {}
self.predict_every = 20 # how often to predict the leak position
self.verbose = True # print the predicted leak position
def set_batch(self, epoch):
"""
Set the batch number
"""
self.batch = epoch
self.step = 0
self.values[epoch] = {}
def get_cuda_perc(self):
# get the percentage of cuda memory used
perc = torch.cuda.memory_allocated() / torch.cuda.max_memory_allocated()
self.values[self.batch][self.step] = perc * 100
self.step += 1
def predict_leak_position(self, diffs, per_epoch_remainder):
# train a tree regressor to predict the per epoch increase
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
# insert a zero at the start of per_epoch_remainder
per_epoch_remainder = torch.cat([torch.tensor([0]), per_epoch_remainder])
# scale the data to be between 0 and 1
x_scaler = MinMaxScaler()
diffs = x_scaler.fit_transform(diffs)
y_scaler = MinMaxScaler()
per_epoch_remainder = y_scaler.fit_transform(per_epoch_remainder.reshape(-1, 1))
# train test split
X_train, X_test, y_train, y_test = train_test_split(diffs, per_epoch_remainder, test_size=0.1, random_state=42)
# train regressor
regressor = DecisionTreeRegressor(random_state=0)
regressor.fit(X_train, y_train)
# predict
y_pred = regressor.predict(X_test)
# calculate error
mse = mean_squared_error(y_test, y_pred)
mag = mse / per_epoch_remainder.mean() * 100
print(f"MSE: {mse} ({mag:.2f}%)")
# find the most important feature
feature_importance = regressor.feature_importances_
most_important_feature = torch.argmax(torch.tensor(feature_importance))
print(f"Likely leak position between step {most_important_feature} and step {most_important_feature + 1}")
def find_leaks(self):
"""
Find leaks in the training loop
"""
if self.batch < 2:
return
if not self.verbose and self.batch % self.predict_every != 0:
return
# estimate per step diff
diffs = []
for epoch, values in self.values.items():
dif = []
for step in range(1, len(values)):
dif += [values[step] - values[step - 1]]
diffs.append(dif)
lens = [len(x) for x in diffs]
min_lens = min(lens)
per_epoch_increase = [self.values[epoch][min_lens - 1] - self.values[epoch][0] for epoch in self.values.keys()
if epoch > 0]
between_epoch_decrease = [self.values[epoch][0] - self.values[epoch - 1][min_lens - 1] for epoch in
self.values.keys() if epoch > 0]
per_epoch_increase = torch.tensor(per_epoch_increase)
between_epoch_decrease = torch.tensor(between_epoch_decrease)
per_epoch_remainder = per_epoch_increase + between_epoch_decrease
per_epoch_increase_mean = per_epoch_remainder.mean()
per_epoch_increase_sum = per_epoch_remainder.sum()
diffs = torch.tensor(diffs)
print(
f"Per epoch increase: {per_epoch_increase_mean:.2f}% cuda memory "
f"(total increase of {per_epoch_increase_sum:.2f}%) currently at "
f"{self.values[self.batch][min_lens - 1]:.2f}% cuda memory")
if self.batch % self.predict_every == 0:
self.predict_leak_position(diffs, per_epoch_remainder)
You can put it in your training loop as such:
leakfinder = LeakFinder()
for batch in dataset:
leakfinder.set_batch(batch)
# do stuff
leakfinder.get_cuda_perc()
# do more stuff
leakfinder.get_cuda_perc()
# do even more stuff
leakfinder.get_cuda_perc()
# find leaks
leakfinder.find_leaks()