Advice on debugging a GPU memory leak in graph?


I found an elusive memory leak in my network. Unfortunately I can’t create a minimal working example, but I’ll try describe the behavior I’m observing below. First, pseudocode of the parts of the model that I think are relevant (let’s say this is in the forward pass of a model plus a line of loss computation at the end):

logits_a = layer_a(inputs) // logits_a is size [B, N]
logits_b = layer_b(inputs) // logits_b is size [B, 1]

combined_logits =, logits_b), dim=1)
combined_probs = torch.softmax(combined_logits, dim=1)
probs_a = combined_probs[:, :-1] // size [B, N]
probs_b = combined_probs[:, -1] // size [B]

... // later on in the code...

// self._weights are parameters of size [1, 1]
h_a = torch.matmul(probs_a.view(B * N, 1), self._weight_a) 
h_b = torch.matmul(probs_b.view(B, 1), self._weight_b) // THIS LINE

... // later on in the code...

loss = - label * torch.log(probs_b)

What I’m observing after multiple batches of forward passes is that if I comment out the line labeled THIS LINE, memory allocation (via torch.cuda.memory_allocated()) is stable, but if I leave that line (even if I don’t use h_b anywhere further in the graph), I see the memory allocation increasing after each batch when the tensors in the graph should have been garbage collected. Note that I am using probs_b later on in the graph, and it doesn’t cause a memory leak if I leave the last line in the code block (I think the main difference between that line and the line that causes the error is that probs_b is not mixed with a parameter in the graph? Could be wrong though.).

Any ideas on what could possibly be going on here, or advice on debugging this? It seems like even when I am not even using the result of this matmul, it’s causing a memory leak. It doesn’t seem to be a problem with probs_a, either.


This only happens when I am using matmul. If I replace matmul with a + or something (e.g.,

h_b = probs_b.view(B, 1) + self._weight_b

, as long as I use a batch size of 1)

there isn’t a memory leak.

The memory leak also happens when I wrap the forward pass in a no_grad and set the model to eval mode. It also happens if I uncomment the backwards pass and optimizer.step in the training code (so basically, it’s just doing a bunch of forward passes on different data).

Another weird thing is that gc isn’t reporting the tensors that are adding to memory_allocated. The tensors reported by gc (even the shared_tensors) are stable across batches, whereas the allocated memory is increasing.

Could you post an executable code snippet so that we could try to reproduce this issue?

Unfortunately I wasn’t able to create a minimal working example :confused: I was able to get around the issue by making a special case when N = 1, by just doing elementwise multiplication in that case using * instead (and relying on broadcasting).