About torch.cuda.empty_cache()

add023 · August 26, 2020, 1:24am

Ah I see I’m trying to get model parallelism and data parallelism work so that I can (hopefully) use multiple GPUs.

mikey_t · December 30, 2020, 1:02pm

Can you please clarify about restarting. Sometimes I fill up cuda with settings that are too much for it, and I want to adjust settings and try again. Is running torch.cuda.empty_cache equivalent to rebooting the machine? Or is it equivalent to closing and re-opening python?
Thank you

albanD · December 30, 2020, 1:16pm

neither of them. It is just returning to the OS the memory not actively used right now.
Restarting python will clear everything used by pytorch.
Restarting the OS will restart the GPU completely hence clearing everything even non-pytorch related.

mikey_t · December 30, 2020, 1:30pm

Thank you @albanD. It seems like for what I’m doing (testing what my GPUs can handle without overloading them), all I need is to run torch.cuda.empty_cache and potentially restart python since I’m only using the server for pytorch right now.

albanD · December 30, 2020, 1:43pm

Yes restarting python is the right thing to do to make sure everything works fine again after a memory error.

mikey_t · December 30, 2020, 1:53pm

Many thanks for your guidance @albanD!

Kuldip · June 14, 2021, 9:38am

Hi
thanks for great replies. In my case I have trained the model on GPU. Now I am using saved model in other code to check the accuracy of my trained network. I tried torch.cuda.empty_cache() but it is not working.
I think this is caused by saved variable on my GPU. I am attaching my code here so you can have batter idea.

model = torch.load('Two_layer_transpose_CNN.pth')
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model.to(device)

torch.cuda.empty_cache()

Now I have trained model to build high-scale image from low-scale image. And in testing I am getting this error.

output = []
for i_batch, sample_batched in enumerate(Data_Loader):
    #print(i_batch)
    
    input = sample_batched['small_image'].float().to(device)
    i = model(input).to(device)
    print(sample_batched['small_image'].shape)
    output.append(i)

this is the output with error.

torch.Size([1, 3, 678, 1020])
torch.Size([1, 3, 678, 1020])
torch.Size([1, 3, 678, 1020])
.
.
.
torch.Size([1, 3, 678, 1020])
torch.Size([1, 3, 696, 1020])
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-14-7986a773d6c9> in <module>
      3     #print(i_batch)
      4 
----> 5     input = sample_batched['small_image'].float().to(device)
      6     i = model(input).to(device)
      7     print(sample_batched['small_image'].shape)

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 2.74 GiB already allocated; 294.40 KiB free; 2.78 GiB reserved in total by PyTorch)

I am stuck here.
Please help if you can. Thanks in Advance.

kingofapplehead · July 9, 2021, 8:30am

I realised that I am having a similar issue as you. I am dealing with a problem where the input from each batch has a different length. Currently, I am using empty.cache() to avoid OOM issues. Otherwise I always get an OOM error after 20-30 epochs.

Did you finally figure out how to solve this problem?

Hanqi_Xiao · November 23, 2024, 7:15pm

def capture_gradients(args, model, save_dir, dataset, masking_function = None, output_hidden_states = False, loss_func = CrossEntropyLoss(), verbose=False) -> torch.Tensor:

    model.eval()
    accumulated_gradient = {}
    dataloader = DataLoader(dataset, batch_size=1, shuffle=False)
    for example in dataloader:
        monitor_all_gpus(args.logger)
        torch.cuda.empty_cache()
        outputs = model(**example, output_hidden_states=output_hidden_states)
        shift_logits = outputs.logits[..., :-1, :].contiguous()  # Get rid of the prediction from the last token, since we don't have a label for it
        shift_labels = example["input_ids"][..., 1:].contiguous()  # Get rid of the label from the first token, since no predictions are made for it
        shift_logits = shift_logits.view(-1, model.config.vocab_size)
        shift_labels = shift_labels.view(-1)
        shift_labels = shift_labels.to(shift_logits.device)
        loss = loss_func(shift_logits, shift_labels)
        loss.backward()
    for name, param in model.named_parameters(): # This is grabs all the params grabbed by model.parameters(), in fact, model.parameters() calls this function under the hood. https://pytorch.org/docs/stable/_modules/torch/nn/modules/module.html#Module.named_parameters
        accumulated_gradient[name] = param.grad
    torch.cuda.empty_cache()
    monitor_all_gpus(args.logger)

    return accumulated_gradient

I am trying to capture the gradients of an LLM. It is possible to fully fine tune the LLM on my setup by using the huggingface Trainer abstraction, so there is no issue with the amount of GPU memory I have.

However, the only way I have been able to make this function work is by adding torch.cuda.empty_cache(). Otherwise, I run out of memory after 3-4 batches. I wonder what could be causing tensors to linger in memory.