Unable to empty cuda cache

I’m trying to free some GPU memory so that other processes can use it. I tried to do that by executing torch.cuda.empty_cache() after deleting the tensor but for some reason it doesn’t seem to work.

I wrote this small script to replicate the problem

os.environ['CUDA_VISIBLE_DEVICES'] = '0'
showUtilization()
t = torch.zeros((1, 2**6, 2**6)).to(f'cuda')
showUtilization()
del t
torch.cuda.empty_cache()
showUtilization()

The memory utilization grows from 5% to 12% after allocating the tensor and stays to 12% even after emptying the cache.
Of course as the process terminates the memory is released but I’d need to do that while the process is running. Does anyone have any idea about how to solve this?

Your approach should work as shown here:

print(torch.cuda.memory_allocated())
> 0
print(torch.cuda.memory_reserved())
> 0

t = torch.zeros((1, 2**6, 2**6)).to(f'cuda')
print(torch.cuda.memory_allocated())
> 16384
print(torch.cuda.memory_reserved())
> 2097152

del t
print(torch.cuda.memory_allocated())
> 0
print(torch.cuda.memory_reserved())
> 2097152

torch.cuda.empty_cache()
print(torch.cuda.memory_allocated())
> 0
print(torch.cuda.memory_reserved())
> 0

Note that the first CUDA operation will create the CUDA context on the device, which will load all kernels, cudnn, etc. onto the device.
This memory is not reported by torch.cuda.memory_allocated() and torch.cuda.memory_reserved() and can be seen via nvidia-smi.

Thanks for your response. If I run your code I get the exact same results but for some reason nvidia-smi doesn’t seem to notice that the memory was deallocated.
If I run this code

def nvidia_smi():
    out = subprocess.Popen(['nvidia-smi'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    sout, serr = out.communicate()
    res = re.findall("([0-9]*)MiB / 16130MiB", str(sout))
    res = [int(x) for x in res]
    return res

os.environ['CUDA_VISIBLE_DEVICES'] = '2'
print(f'{nvidia_smi()[2]} MiB')

t = torch.zeros((1, 2**6, 2**6)).to(f'cuda')
print(f'{nvidia_smi()[2]} MiB')

del t
print(f'{nvidia_smi()[2]} MiB')

torch.cuda.empty_cache()
print(f'{nvidia_smi()[2]} MiB')

The result is

470 MiB
1473 MiB
1473 MiB
1471 MiB

As the process terminate the used memory goes down to 470 MiB again.
For some reason empty_cache() manages to deallocate 2 MiB (this is consistent and not due to other processes on the same GPU I’ve tried it multiple times). Thinkig about it I guess that those 2 MiB are the size of the tensor I allocate. The other 1001 MiB are probably allocated by the CUDA backend for some internal functioning reason.

I was gonna ask if there’s a way to prevent that but I don’t think so.

Yes, the 2MB are shown in the torch.cuda.memory_reserved() output, which gives you the allocated and cached memory: 2097152 / 1024**2 = 2.0MB.

Yes, these are used by the CUDA context as described before and cannot be freed.