GPU memory not fully released after training loop

My training + cross-validation loop usually looks something like:

for epoch in range(max_epoch):

    mean_err = 0
    for train_idx, train_data in enumerate(train_dataloader):
        # wrap train_data in Variables; move to GPU
        # zero gradients
        # do forward/backward propagation stuff
        # update mean_err
    print('Mean Training Error:', mean_err)
    mean_err = 0
    for cv_idx, cv_data in enumerate(cv_dataloader):
        # wrap cv_data in Variables; move to GPU
        # do forward propagation stuff
        # update mean_err
    print('Mean Cross-Validation Error:', mean_err)

What I often find is that there is a particular batch size in which the training loop runs just fine, but then the GPU immediately runs out of memory and crashes in the cross-validation loop. If I step down the batch size, the system becomes stable, and will run indefinitely through many epochs. What’s going on? The inference-only cross-validation should be less memory intensive than the training, so obviously not all the memory used during the training loop is being freed. How do I fix this so that I can run larger batches?


I see the same problem. The train works fine, but as I try to run validation data the memory on the card increases till I get a CUDA out of memory error. Very interested in a solution to this.

FWIW, my model is an RNN with LSTM units. Not sure if that matters.

My guess is in the training loop you assign various variables, likely something like prediction = net(train_data). This prediction has pointers to the rest of the entire computation graph the data goes through (so probably a lot of memory). Once the training loop finishes, the last assignment to prediction is still kept, since in Python loops and ifs don’t create new scopes. All those variables are in the function scope (or global scope). So there’s no way to know that that whole graph should now be reclaimed.

Either explicitly del the variables after the training loop, or put the training loop into a separate function.

By they way, also make sure you’re not doing something like mean_err += err where err is a Variable, because that will also create a large unnecessary graph. Use instead.


I am already using the[0], but that is good to know. del of variables does not seem to free up the CUDA memory at all. Is there something I need to do for CUDA vars after I call del?

For my use case, the difference between train and validate is the calls to:


Could those be freeing up the memory in some way? My train loop has far more batches and has no issues with GPU memory.

I don’t think there’s anything special you need to do with CUDA variables, I haven’t in the past.
Just to be sure, when you say it doesn’t free up memory do you mean you still crash, or are you using a profiler of some kind? So far as I know the memory will still be claimed by PyTorch for later use, so still used from any profiler’s perspective.

To check if it’s memory unfreed from training or actually something wrong with your validation loop, you can try commenting out the training loop entirely and use the larger batch size.

My nvidia-smi goes to the max memory on the card and the loop will crash with:

THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/generic/ line=66 error=2 : out of memory

Are you deleting the direct output of the net and any other variables calculated from that output, say your loss?

Per your ealier point, the graph is building on itself and continuing to increase in size. It was subtle detail for me as a newbie to PyTorch.

I am using an RNN with LSTM units. I have to detach the hidden states at each pass otherwise I build up the size of the network.

From an example, I had already gotten:
def repackage_hidden(h): """Wraps hidden states in new Variables, to detach them from their history.""" if type(h) == Variable: return Variable( else: return tuple(repackage_hidden(v) for v in h)
but I was not calling that on my validation loop. Adding that in, I can proceed and validation works with no trouble.

After debugging for hours I found this line of code solved my problem:

with torch.no_grad():
    valid()  # function of your validation loop

Strongly suggest this should be put into official tutorials and examples.

1 Like

Could you write an example? I have met the same problem and I can’t understand your solution.
Thanks a lot!

for epoch in range(num_epochs):
    train()  # function contains the training loop.
    with torch.no_grad():
        valid()  # function contains the validation loop.

I think I’ve made it more clear this time.


I also have same problem in my code. Luckily, I found the method to solve it.
you can set torch.backends.cudnn.benchmark=False