Releasing cuda memory after each run

simines · March 24, 2023, 5:42am

I am doing hyperparameter tuning using Hyperopt and 2 gpus. Initially the gpu RAM used is 758 MB which is less than the threshold that I have defined, but after doing one more training the RAM used increase to 1796. To solve this issue I tried using torch.cuda.empty_cache() after each training, but it seems that it is not working. Also, I tried deleting some of the tensors after training but the number of such variables are lot.
Does anyone know how I can solve this problem?

eqy · March 24, 2023, 6:34am

How are you measuring the memory usage before and after each run? If you are measuring memory usage before all kernels have been executed once, then lazy module loading would increase memory usage as kernels are loaded for the first time and clearing the cache wouldn’t free this memory if the process is kept alive. Similarly, memory is also used for workspaces (e.g., cuBLAS) or “plans” (e.g., cuFFT) that would be kept around as workspaces and plans are likely to be reused. It is generally not advisable to manually free this memory but they can be with functions such as

torch._C._cuda_clearCublasWorkspaces()

and

https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.clear

If you are observing that the memory usage is still increasing with each run after everything has fully “warmed up,” then that would be an unexpected bug.

simines · March 24, 2023, 6:46am

Thank you for your reply.
For each run, I am using the following:
torch.cuda.init()
res = {‘gpu’: torch.cuda.utilization(device)}
torch_cuda_mem = torch.cuda.mem_get_info(device)
mem = {
‘used’: torch_cuda_mem[-1] - torch_cuda_mem[0],
‘total’: torch_cuda_mem[-1]
}

eqy · March 24, 2023, 7:09am

Right, so I would check that memory usage stabilizes after warmup (running all the computation/a single iteration once) if the initial measurement is being done before everything has been run once.

simines · March 24, 2023, 7:15am

So you mean initially when I run the first training the memory used is 758, and then I empty cache and do another training and the memory increased slightly and then stabilizes for the remaining training jobs?

eqy · March 24, 2023, 7:16am

Yes, unless you mean the first measurement was done following the first training run.

simines · March 24, 2023, 7:17am

Thank you. No, it was done before each training run.