Pytorch model loading with memory leakage on CPU

I have a PyTorch model deployed in production and the model gets retrained every day and replaces the old model. During this process, I see the memory usage increases monotonically until it saturates. I reproduced my observation with the following code. Do we have any solution to avoid memory leakage?

from torchvision.models import resnet50, ResNet50_Weights
import resource
import matplotlib.pyplot as plt

def mem_usage():
    memory_usage_rss_self = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024
    memory_usage_rss_children = resource.getrusage(resource.RUSAGE_CHILDREN).ru_maxrss /1024
    return memory_usage_rss_self + memory_usage_rss_children

memory = []

for i in range(300):
	model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V1).eval().cpu()
	del model
        gc.collect()
	memory.append(mem_usage())

plt.plot(memory)
plt.ylabel('memory')
plt.xlabel('time')
plt.savefig('memory.pdf', bbox_inches='tight', pad_inches=0)

Hi Amir,

I have run your code in a Colab notebook and noticed the same thing. However, this issue seems to correspond to storing data inside memory rather than memory leakage on the CPU. This means that you might have some variables (lists, Objects, etc.) that are continually updating, increasing memory usage over time.

Here is the code to reproduce my tests:
memory_usage_overtime.py

import gc
import torch
from torchvision.models import resnet50, ResNet50_Weights
import resource
import matplotlib.pyplot as plt

def mem_usage():
    memory_usage_rss_self = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024
    memory_usage_rss_children = resource.getrusage(resource.RUSAGE_CHILDREN).ru_maxrss /1024
    return memory_usage_rss_self + memory_usage_rss_children

memory = []

for i in range(300):
    gc.collect()
    memory.append(mem_usage())
    plt.clf()
    plt.plot(memory)
    plt.ylabel('memory')
    plt.xlabel('time')
    plt.show()

Here is the code for how to measure your GPU memory, which shows that the GPU usage isn’t increasing over time:
memory_usage_gpu.py

import gc
import torch
from torchvision.models import resnet50, ResNet50_Weights
import resource
import matplotlib.pyplot as plt
import subprocess as sp
import os
from threading import Thread , Timer
import sched, time
def get_gpu_memory():
    output_to_list = lambda x: x.decode('ascii').split('\n')[:-1]
    ACCEPTABLE_AVAILABLE_MEMORY = 1024
    COMMAND = "nvidia-smi --query-gpu=memory.used --format=csv"
    try:
        memory_use_info = output_to_list(sp.check_output(COMMAND.split(),stderr=sp.STDOUT))[1:]
    except sp.CalledProcessError as e:
        raise RuntimeError("command '{}' return with error (code {}): {}".format(e.cmd, e.returncode, e.output))
    memory_use_values = [int(x.split()[0]) for i, x in enumerate(memory_use_info)]
    return memory_use_values

memory = []

for _ in range(50):
    torch.cuda.empty_cache()
    model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V1).eval().to("cuda")
    del model
    memory.append(get_gpu_memory()[0])
    plt.clf()
    plt.plot(memory)
    plt.ylabel('memory')
    plt.xlabel('time')
    plt.show()

Thanks @sudomaze . I ran your code for GPU and the memory does not change. I also ran your code for CPU (without model creation and only storing the memory usage) and I do not see an increase in the memory usage either. Very interesting that it has different behavior (I just made a quick change that puts the plotting commands out of for loop)

The first code should show you that the memory increases, which is an indication that it is because the memory variable is being updated.

You might try to make sure that all of the variables in the environment don’t count in the memory usage calculation because updating a list/object/etc. will increase memory usage.

Hi @ptrblck . I observe the memory increase due to the Pytorch model loading on the CPU. I created a small code that reproduces my problem. Do you have any opinion on how I can resolve this issue? Thanks in advance

Sorry, I don’t fully understand the use case as I’m only seeing a small kB increase in each iteration, which might correspond to the memory values stored in the list.

Thanks @ptrblck and @sudomaze for your responses