Recently, I used the function torch.cuda.empty_cache() to empty the unused memory after processing each batch and it indeed works (save at least 50% memory compared to the code not using this function).
At the same time, the time cost does not increase too much and the current results (i.e., the evaluation scores on the testing dataset) are more or less OK.
Since I just do the comparison on my program, it is too limited to make a conclusion. I wonder if this function will affect the training of the model?
This function should not be used by the end-user except in very edge cases.
Pytorch does not release the memory back to the OS when you remove Tensors on the GPU, it keeps it in a pool so that next allocations can be done much faster. As you saw, without this, GPU code is much slower.
You don’t need to call this function explicitly as, even though you don’t see it in nvidia-smi, the memory is available to create more Tensors and so will not cause out of memory problems.
@albanD Thanks for your reply. In my case, if I set a large batch size, the program will encounter Out Of Memory error (after several batches) and the error can be avoid by adding this function. Meanwhile, it just increases the time cost by 100 seconds per each epoch (~7000 batches). I think this function is useful but why you don’t recommend using it?
Because you get an OOM error, this function is actually called internally to try and save you. If adding this function by hand change the behaviour, that means that by change, you can free a bit more memory (because of partial block uses). But this is by chance and this is not a “healthy” state for your program: it will be much slower and any small change could change the block usage and make it OOM again (something as small as swapping two Tensors creations).
One case where you would want to use it though is if you use cudnn benchmark mode (
torch.backends.cudnn.benchmark=True), then you can add one after the very first forward of the program that you do. As the benchmark mode can allocate large memory blocks during the very first forward to test algorithms and that won’t be good latter.
I also use torch.cuda.empty_cache , but found it will take about 3-5s per call time. anyone how to release it efficiently
This is expected to be slow. Though not several seconds (make sure to use torch.cuda.synchronize() properly when timing code)
You should not have to use this function for normal use. Why do you use it?
it is related to this question Pytorch abnormal inference time
I am getting an OOM:
RuntimeError: CUDA out of memory. Tried to allocate 2.58 GiB (GPU 0; 15.78 GiB total capacity; 8.19 GiB already allocated; 1.67 GiB free; 4.96 GiB cached; 0 bytes inactive) (malloc at ... cuda/CUDACachingAllocator.cpp:382)
4.96 GiB cached along with
1.67 GiB free, yet the torch couldn’t allocate
I am suspecting that malloc was trying to find
2.58GiB contiguous memory block but it couldnt?
My question is: If I force
torch.cuda.empty_cache() once in a while (maybe once per epoch or for N number of steps?), will it improve the contiguous block situation between free and cached pools?
No it most like won’t.
The trick used to get more contiguous block that we use is to free the memory back to the GPU driver and allocate it back. But that is already done when you’re about to run out of memory.
And doing this repeatidely will slow down the process for not much gain.
But this is quite a bad case of fragmentation. Do you have a small code sample that reproduces it? We might be able to improve the allocator for this bad case.
Thank you for the instant reply!
My code is not so small to easily shared:
Its a transformer NMT model. (In the above case, the model was wrapped in data-parallel on 4 GPUs.)
I am worried the fragmentation issue is not deterministically reproducible from my code alone, because it is mainly due to data set.
I need to pack a small dataset, with manually seeding the RNGs to make it reproducible. On smaller datasets, it is not so an issue as there is less variance (described below). I will follow-up when I have all those requirements checked reproducing on your machine.
Let me describe my situation:
The reason for such a bad fragmentation - I think - is because the training data is text with unequal sentence lengths. Each batch can have its own sequence length with paddings, and we shuffle the batches, so there can be a large variance between the memory requirement for any two consecutive batches.
We improved the situation with
batch_size= Bx L ; where
L=#tokens in sentence including padding.
BxL has a limit of batch_size, both B and L can vary individually between any consecutive batches. e.g. for BxL= 4096 and for the model of dim
In an extreme case, we are seeing the tensors of shape
But on an average case, due to variance in lengths, the BxL is NOT guaranteed to be 4096 exactly, it can be a few bytes lower sometimes, say 10x409 = 4090 for example. Hence there are still tiny variations (spikes and surges) in memory usage across batches. I suspect these variations lead to defragmentation in the long run eventually causing Cuda OOM.
With this new information, is there any new advice for me to improve the fragmentation situation?
cc @colesbury that might have a better idea how to do this?
In my case, I perform a lot of computation outside of the graph (under
torch.no_grad context), where many large tensors are being computed or randomly generated at every forward pass.
I just inserted four
torch.cuda.empty_cache() calls throughout my ops at every forward pass, which resulted in ~20% slowdown, but I’m able to increase my batch size from 9 to 14, which is a good trade-off for me. So far I haven’t run into any issues, and the model (custom VGG-like network) trains fine.
I empty the cache immediately after I delete a tensor with
Could you suggest any additional measures or alternatives to save memory?
I avoided CUDA out of memoy issue by inserting
torch.cuda.empty_cache() at end of one batch processing. However, I have one question regarding its behaviour (my question may seem a bit naive since I am new to pytorch), if I delete the model instance from the GPU instead of previous batch tensor after end of one batch processing, how’s it going to affect my model’s training? Does ‘model’ object holds onto some important information regarding model’s state or weights?
The model does contain all the weights and necessary “forward” functions to evaluate your model. So if you delete it, you won’t be able to evaluate it any more.
I am trying to sort out some memory issues so evaluation is not a problem currently. I wanted to ask if I delete the model instance after each iteration, does it affect in next iteration? I mean does pytroch stores the model’s state somewhere else so even after deleting the model’s instance, it will have a mechanism to retrieve the model’s state?
Can you give pseudo code of what you want to do? I’m not sure to understand.
Do you mean this?
model = Model()
for sample in data:
out = model(sample)
# more stuff
# Then the second iteration will fail because model does not exist anymore!
@albanD Hello, I ran into issue CUDA out of memory.
RuntimeError: CUDA out of memory. Tried to allocate 960.00 MiB (GPU 0; 15.78 GiB total capacity; 13.96 GiB already allocated; 747.44 MiB free; 13.98 GiB reserved in total by PyTorch)
13.98GB is reserved by PyTorch, so the model is unable to run. I’ve read to restart session to free up memory. What’s the accepted way to ‘restart session’? I am running my model in a docker with pre-assigned resources.
the restart session most likely refers to people that use notebook such as colab. For which the memory is only released if you restart the kernel.
I see. Any solution for folks who are running in linux terminal? Even though you don’t recommend it, I tried to use torch.cuda.empty_cache() just before training, however, it didnt work. Still getting the same error.
Well if you run in a terminal, it is already freed when you start the program. So you already have it working fine
You might want to check that you don’t have other programs running that use up GPU memory though.
But otherwise, there is not much gain you can get here. You’ll most likely will have to reduce the network size or batch size if it does not fit in memory