GPU memory not fully released after training loop

abweiss · May 25, 2017, 8:24pm

My training + cross-validation loop usually looks something like:

for epoch in range(max_epoch):
    print('Epoch:',epoch)

    # TRAINING
    mean_err = 0
    net.train()
    for train_idx, train_data in enumerate(train_dataloader):
        # wrap train_data in Variables; move to GPU
        # zero gradients
        # do forward/backward propagation stuff
        # update mean_err
    print('Mean Training Error:', mean_err)
    
    # CROSS-VALIDATION
    mean_err = 0
    net.eval()
    for cv_idx, cv_data in enumerate(cv_dataloader):
        # wrap cv_data in Variables; move to GPU
        # do forward propagation stuff
        # update mean_err
    print('Mean Cross-Validation Error:', mean_err)

What I often find is that there is a particular batch size in which the training loop runs just fine, but then the GPU immediately runs out of memory and crashes in the cross-validation loop. If I step down the batch size, the system becomes stable, and will run indefinitely through many epochs. What’s going on? The inference-only cross-validation should be less memory intensive than the training, so obviously not all the memory used during the training loop is being freed. How do I fix this so that I can run larger batches?

Bobak_Farzin · May 26, 2017, 1:53pm

I see the same problem. The train works fine, but as I try to run validation data the memory on the card increases till I get a CUDA out of memory error. Very interested in a solution to this.

FWIW, my model is an RNN with LSTM units. Not sure if that matters.

spruceb · May 26, 2017, 3:11pm

My guess is in the training loop you assign various variables, likely something like prediction = net(train_data). This prediction has pointers to the rest of the entire computation graph the data goes through (so probably a lot of memory). Once the training loop finishes, the last assignment to prediction is still kept, since in Python loops and ifs don’t create new scopes. All those variables are in the function scope (or global scope). So there’s no way to know that that whole graph should now be reclaimed.

Either explicitly del the variables after the training loop, or put the training loop into a separate function.

By they way, also make sure you’re not doing something like mean_err += err where err is a Variable, because that will also create a large unnecessary graph. Use err.data instead.

Bobak_Farzin · May 26, 2017, 3:37pm

I am already using the err.data[0], but that is good to know. del of variables does not seem to free up the CUDA memory at all. Is there something I need to do for CUDA vars after I call del?

For my use case, the difference between train and validate is the calls to:

opt.zero_grad()
...
loss.backward() 
opt.step()

Could those be freeing up the memory in some way? My train loop has far more batches and has no issues with GPU memory.

spruceb · May 26, 2017, 3:55pm

I don’t think there’s anything special you need to do with CUDA variables, I haven’t in the past.
Just to be sure, when you say it doesn’t free up memory do you mean you still crash, or are you using a profiler of some kind? So far as I know the memory will still be claimed by PyTorch for later use, so still used from any profiler’s perspective.

To check if it’s memory unfreed from training or actually something wrong with your validation loop, you can try commenting out the training loop entirely and use the larger batch size.

Bobak_Farzin · May 26, 2017, 3:57pm

My nvidia-smi goes to the max memory on the card and the loop will crash with:

THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory

spruceb · May 26, 2017, 3:58pm

Are you deleting the direct output of the net and any other variables calculated from that output, say your loss?

Bobak_Farzin · May 26, 2017, 3:59pm

Per your ealier point, the graph is building on itself and continuing to increase in size. It was subtle detail for me as a newbie to PyTorch.

I am using an RNN with LSTM units. I have to detach the hidden states at each pass otherwise I build up the size of the network.

From an example, I had already gotten:
def repackage_hidden(h): """Wraps hidden states in new Variables, to detach them from their history.""" if type(h) == Variable: return Variable(h.data) else: return tuple(repackage_hidden(v) for v in h)
but I was not calling that on my validation loop. Adding that in, I can proceed and validation works with no trouble.

JiamingSun · August 7, 2018, 11:58am

After debugging for hours I found this line of code solved my problem:

with torch.no_grad():
    valid()  # function of your validation loop

Strongly suggest this should be put into official tutorials and examples.

Mata_Fu · September 12, 2018, 1:55pm

@JiamingSun
Could you write an example? I have met the same problem and I can’t understand your solution.
Thanks a lot!

JiamingSun · September 13, 2018, 10:26am

for epoch in range(num_epochs):
    train()  # function contains the training loop.
    with torch.no_grad():
        valid()  # function contains the validation loop.

I think I’ve made it more clear this time.

xin_Wang1 · April 3, 2021, 1:33pm

I also have same problem in my code. Luckily, I found the method to solve it.
you can set torch.backends.cudnn.benchmark=False