Freeing memory on each step?

I have a model that operates on inputs of variable length. At inference time, I need to split the data into chunks of fixed size, else I will get a cuda oom. I set the size of the chunks to be as large as possible, so as to fill my GPU memory. The code goes like this:

    # Forward chunks
    chunks = extract_chunks(input)    # Returns numpy arrays
    for chunk in chunks:
        x, y_true = collate([chunk])  # Collate a single chunk into a batch, returns CPU tensors
        x, y_true = x.cuda(), y_true.cuda()
        y_pred = model(x)

And I monitor my GPU usage (with nvidia-smi) when running this code.

I noticed that if I add the line

        # Free the memory
        del x, y_pred, y_true

at the end of the loop, the memory usage is halved, and I can effectively double my chunk size. There seems to be no overhead for doing this either. What is going on here?

I realized that I had forgotten to wrap my loop with torch.no_grad(). The behaviour does not occur anymore. Does that mean that gradients were not deleted until calling del?

Did you store y_pred or any other tensor attached to the computation graph somehow?
Usually you see the memory growing if you store a tensor with the computation graph, e.g.

output = model(data)
loss = criterion(output, target)
losses.append(loss)   # will store the whole computation graph -> increased memory usage

I haven’t. There isn’t much more to the code than what I wrote.