Freeing CUDA memory after forwarding tensors

Consider the following loop:

    for batch in dataloader:
        batch = batch.cuda()
        features = model(batch)

If forwarding a batch takes up almost the memory in my GPU (let’s say 7gb out of 8), then this loop will fail on the second iteration due to a OOM error.

This version won’t:

    for batch in dataloader:
        batch = batch.cuda()
        features = model(batch)
        del features

Even though features is only a very small tensor (a dozen values). What does del exactly do here? Why do I need to manually free the memory to achieve what I want? Is this good practice? If no, what is an alternative that doesn’t require to halve the batch size?

The whole computation graph is connected to features, which will also be freed, if you didn’t wrap the block in a torch.no_grad() guard.
However, the second iteration shouldn’t cause an OOM issue, since the graph will be freed after optimizer.step() is called.

If you run out of memory after the training and in the first evaluation iteration, you might keep unnecessary variables alive due to Python’s function scoping as explained in this post.

2 Likes

Indeed, I understand now. The previous iteration of features remains in memory while model(batch) is evaluated the second time, and so there are points in the execution where two different graphs exist in memory.

You’re right, in inference mode I should be wrapping the call with no_grad.

Would this not happen in a training loop, as the additional backward pass would free the graph?

1 Like

In a training loop you would usually reassign the output to the same variable, thus deleting the old one and store the current output.

If you are using e.g. different variables for the output, losses etc. in the training and validation loop, you would waste a bit of memory, which could be critical, if you are using almost the whole GPU memory.

1 Like