GPU memory leak

I run out of GPU memory when training my model. The leak seems to be happening at the first call of loss.backward(). I guess that somehow a copy of the graph remain in the memory but can’t see where it happens and what to do about it.

Here’s my fit function:

    val_loss_best = np.inf
    losses = []
    # Prepare loss history
    for epoch in range(epochs): 
        for idx_batch, (x, y) in enumerate(dataloader_train):            
            # Propagate input
            mem_gpu = torch.cuda.memory_stats(device='cuda:0')["allocated_bytes.all.current"]
            print(f'Before forward: {mem_gpu:,}')
            netout = net(,
            mem_gpu = torch.cuda.memory_stats(device='cuda:0')["allocated_bytes.all.current"]
            print(f"After forward: {mem_gpu:,}")

            # Compute loss on the training set
            loss = loss_function(netout,
            mem_gpu = torch.cuda.memory_stats(device='cuda:0')["allocated_bytes.all.current"]
            print(f'After loss calc: {mem_gpu:,}')
            # Backpropage loss
            mem_gpu = torch.cuda.memory_stats(device='cuda:0')["allocated_bytes.all.current"]
            print(f'After backward: {mem_gpu:,}')

            # Update weights
        # Compute loss on the validation set   
        mem_gpu = torch.cuda.memory_stats(device='cuda:0')["allocated_bytes.all.current"]
        print(f'After net.eval(): {mem_gpu:,}')
        # mem_gpu = torch.cuda.memory_stats(device='cuda:0')["allocated_bytes.all.current"]
        # print(f'After torch.no_grad():  {mem_gpu:,}')
        val_loss = compute_loss(net, dataloader_val, loss_function, device) #float
        mem_gpu = torch.cuda.memory_stats(device='cuda:0')["allocated_bytes.all.current"]
        print(f'After val loss calc: {mem_gpu:,}')
        if val_loss < val_loss_best:
            val_loss_best = val_loss
    return val_loss_best
def compute_loss(net: torch.nn.Module,
                 loss_function: torch.nn.Module,
                 device: torch.device = 'cpu') -> float:

    running_loss = 0
    with torch.no_grad():
        for idx_batch, (x, y) in enumerate(dataloader): #iterate across batches
            netout = net(
            current_loss = loss_function(, netout).item()
            running_loss += current_loss

    return running_loss / len(dataloader)

and loss_function is nn.MSE().
The output of this code is:
Before I call fit():
Before initialising model: 534,819,328
After initialising model: 635,895,808

Now we go into fit():

Before forward: 635,895,808
After forward: 647,901,184
After loss calc: 647,902,208
After backward: 711,877,120

Before forward: 787,703,808
After forward: 799,555,072
After loss calc: 799,555,584
After backward: 863,530,496

Before forward: 787,703,808
After forward: 799,555,072
After loss calc: 799,555,584
After backward: 863,530,496
… (these four lines will repeat until we exit from fit()).
Then, After moving model to cpu: 686,627,328

So I’m losing 150MB of GPU memory. Any ideas, please help!

Which optimizer are you using? If it’s holing internal states an increase in memory is expected after the first step() call and you are not facing a memory leak.

Actually, on a more careful glance, the leak might be happening on the second call (as well?).
I am not sure what you mean by “not facing a memory leak”. I finish my training having 150MB less than when I start it and can’t see what consumes this memory. Since I am doing this training as part of a loop (models change over time) I exhaust my GPU memory pretty quickly.
In other words, no matter whether or not it’s a leak there must be a way to release this memory. I just don’t see how.
What am I missing?

The memory increases as expected and saturates at 863MB, which does not indicate any leaks.

Which is expected, since you are using Adam, which will create internal states for each param, and are not checking the memory after the optimizer.step() call.

I don’t understand how this can happen, if the memory usage is static based on your comment:

No, since Adam requires this memory. You would need to delete the optimizer to be able to release the memory.

The memory usage is static in my comment because it’s just one training of the model (at one point in time). These numbers are produced within loop

for epoch in range(epochs): 
        for idx_batch, (x, y) in enumerate(dataloader_train):

in my code. Then I move to the next data point (the next point in time, which mean a new window and new dataloader_train and dataloader_val) and train another model, which will consume another 150MB.
It had not occurred to me that the problem might be caused by an optimizer.

del optimizer

seems to do the trick but there’s still about 20MB left after the first time step and a couple of MB which adds at every ensueing time step. What can it be and what to do about it?

I have identified the problem. It turns out that I had an assignment to a tensor, which was a class attribute, in the forward pass, something like:

self._ten = torch.bmm(...)

It was enough to change it to:

ten = torch.bmm(...)