I am doing hyperparameter tuning using Hyperopt and 2 gpus. Initially the gpu RAM used is 758 MB which is less than the threshold that I have defined, but after doing one more training the RAM used increase to 1796. To solve this issue I tried using torch.cuda.empty_cache() after each training, but it seems that it is not working. Also, I tried deleting some of the tensors after training but the number of such variables are lot.
Does anyone know how I can solve this problem?
How are you measuring the memory usage before and after each run? If you are measuring memory usage before all kernels have been executed once, then lazy module loading would increase memory usage as kernels are loaded for the first time and clearing the cache wouldnât free this memory if the process is kept alive. Similarly, memory is also used for workspaces (e.g., cuBLAS) or âplansâ (e.g., cuFFT) that would be kept around as workspaces and plans are likely to be reused. It is generally not advisable to manually free this memory but they can be with functions such as
torch._C._cuda_clearCublasWorkspaces()
and
https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.clear
If you are observing that the memory usage is still increasing with each run after everything has fully âwarmed up,â then that would be an unexpected bug.
Thank you for your reply.
For each run, I am using the following:
torch.cuda.init()
res = {âgpuâ: torch.cuda.utilization(device)}
torch_cuda_mem = torch.cuda.mem_get_info(device)
mem = {
âusedâ: torch_cuda_mem[-1] - torch_cuda_mem[0],
âtotalâ: torch_cuda_mem[-1]
}
Right, so I would check that memory usage stabilizes after warmup (running all the computation/a single iteration once) if the initial measurement is being done before everything has been run once.
So you mean initially when I run the first training the memory used is 758, and then I empty cache and do another training and the memory increased slightly and then stabilizes for the remaining training jobs?
Yes, unless you mean the first measurement was done following the first training run.
Thank you. No, it was done before each training run.