It is because the cuda backend uses a caching allocator. This means that the memory is freed but not returned to the device.
if after running del test you allocate more memory with test2 = torch.Tensor(1000,1000), you will see that the memory usage will stay exactly the same: it did not re-allocated memory but re-used the one that had been freed when you ran del test.
Hi @albanD,
I am using different models for inference. But my gpu space is only 10gb.
How do I unload a model from cuda and switch/load another model to cuda?
Doing model.cpu() will move it back to the cpu.
Assuming that nothing else references the weights, they will be freed and returned to our allocator to be used for other Tensors.
Here’s a miminal example showing that memory is not freed. I want to complete free the tensor memory from GPU, and to be able to see it in nvidia-smi:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
import torch
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(
pretrained_model_name_or_path='decapoda-research/llama-7b-hf',
load_in_8bit=True,
device_map={'': 0},
)
del model
torch.cuda.empty_cache()
print('breakpoint here - is memory freed?')