Releasing memory after running a PyTorch model inference

I have a small dummy feedforward network defined in PyTorch in which I am making inference like the following -

import torch
import torch.nn as nn

device = torch.device("cpu")

n_input, n_hidden, n_out = 100, 150, 1
batch_size = 5000

data_x = torch.randn(batch_size, n_input)
data_x =

def create_model():
    hidden_layers = [nn.Linear(n_hidden, n_hidden), nn.ReLU()]
    model = nn.Sequential(*([
        nn.Linear(n_input, n_hidden), nn.ReLU()] +
        hidden_layers * 20 +
        [nn.Linear(n_hidden, n_out), nn.Sigmoid()

def make_prediction(model, data_x):
    return model(data_x)

def main():
    model = create_model()

    y_pred = make_prediction(model, torch.randn(batch_size, n_input))
    y_pred = make_prediction(model, torch.randn(batch_size, n_input))
    y_pred = torch.rand((batch_size, n_out))


I am interested in knowing how much memory each operation takes up. Using memory-profiler (I run python -m memory_profiler, I profile the code and get this result -


Line #    Mem usage    Increment  Occurrences   Line Contents
    26  270.777 MiB  270.777 MiB           1   @profile
    27                                         def main():
    28  271.773 MiB    0.996 MiB           1       model = create_model()
    30  396.586 MiB  124.812 MiB           1       y_pred = make_prediction(model, torch.randn(batch_size, n_input))
    31  513.059 MiB  116.473 MiB           1       y_pred = make_prediction(model, torch.randn(batch_size, n_input))
    32  513.059 MiB    0.000 MiB           1       y_pred = torch.rand((batch_size, n_out))

Could someone please explain this result? Specifically, I am not sure if I understand why the total memory usage keeps culminating after every inference call. Once the inference is ran, and y_pred computed, why does torch still keeps using around 120 MB?

To check if y_pred itself is not taking up that much memory, I create another random y_pred like array at the end, and as you can see, it’s using almost no memory. All of the ~120MB is used for the intermediate computations for running data_x through the network, and yet, that memory is not released once the computations are completed, and y_pred calculated.

Am I understanding this wrong or does memory-profiler does not work with torch? In order to try and ensure this is not some GPU related issue that memory-profiler cannot track, I am forcing everything to happen on CPU.

Any help is appreciated. Thanks!

EDIT: I tried @torch.no_grad() and torch.cuda.empty_cache() and that does not fix this issue.

1 Like

This is most likely due to autograd tracking the computation graph for each of your calls to make_prediction separately. I think using torch.no_grad should fix this issue.


# no gradients / computation graph will be tracked, saving memory
def make_prediction(model, data_x):
    return model(data_x)

Thank you for the response. I should have included using torch.no_grad and torch.cuda.empty_cache() in the original question. But that does not actually solve this problem.
So if I do @torch.no_grad() on top of the function, that does help reduce the peak memory used by the call by a lot. So instead of 124 MB, it takes up around 30 MB. That’s really good, but my problem is that this 30 MB does not seem to be released after computations. My actual problem is with another much bigger model whose peak memory is much more (even with no_grad()), and if that memory is not released back, it just breaks downstream tasks.

You could try calling gc.collect after making predictions. I’m unsure how one would go about using both gc.collect() and torch.cuda.empty_cache() in combination though. Could be the case that call order makes a difference.