About torch.cuda.empty_cache()

Ah I see :slight_smile: Iā€™m trying to get model parallelism and data parallelism work so that I can (hopefully) use multiple GPUs.

1 Like

Can you please clarify about restarting. Sometimes I fill up cuda with settings that are too much for it, and I want to adjust settings and try again. Is running torch.cuda.empty_cache equivalent to rebooting the machine? Or is it equivalent to closing and re-opening python?
Thank you

neither of them. It is just returning to the OS the memory not actively used right now.
Restarting python will clear everything used by pytorch.
Restarting the OS will restart the GPU completely hence clearing everything even non-pytorch related.

1 Like

Thank you @albanD. It seems like for what Iā€™m doing (testing what my GPUs can handle without overloading them), all I need is to run torch.cuda.empty_cache and potentially restart python since Iā€™m only using the server for pytorch right now.

Yes restarting python is the right thing to do to make sure everything works fine again after a memory error.

1 Like

Many thanks for your guidance @albanD!

Hi
thanks for great replies. In my case I have trained the model on GPU. Now I am using saved model in other code to check the accuracy of my trained network. I tried torch.cuda.empty_cache() but it is not working.
I think this is caused by saved variable on my GPU. I am attaching my code here so you can have batter idea.

model = torch.load('Two_layer_transpose_CNN.pth')
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model.to(device)

torch.cuda.empty_cache()

Now I have trained model to build high-scale image from low-scale image. And in testing I am getting this error.

output = []
for i_batch, sample_batched in enumerate(Data_Loader):
    #print(i_batch)
    
    input = sample_batched['small_image'].float().to(device)
    i = model(input).to(device)
    print(sample_batched['small_image'].shape)
    output.append(i) 

this is the output with error.

torch.Size([1, 3, 678, 1020])
torch.Size([1, 3, 678, 1020])
torch.Size([1, 3, 678, 1020])
.
.
.
torch.Size([1, 3, 678, 1020])
torch.Size([1, 3, 696, 1020])
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-14-7986a773d6c9> in <module>
      3     #print(i_batch)
      4 
----> 5     input = sample_batched['small_image'].float().to(device)
      6     i = model(input).to(device)
      7     print(sample_batched['small_image'].shape)

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 2.74 GiB already allocated; 294.40 KiB free; 2.78 GiB reserved in total by PyTorch)

I am stuck here.
Please help if you can. Thanks in Advance.

I realised that I am having a similar issue as you. I am dealing with a problem where the input from each batch has a different length. Currently, I am using empty.cache() to avoid OOM issues. Otherwise I always get an OOM error after 20-30 epochs.

Did you finally figure out how to solve this problem?