I started working on a tool that will provide this solution.
I think I have the prototype working, but I’m stuck at not being able to emulate memory fragmentation - so I can’t test its correctness.
Is the following guaranteed to allocate a contiguous block of GPU RAM?
torch.ones((d, d)).cuda().contiguous()
I added contiguous()
, but it doesn’t seem to make any difference in this case. (and checked with is_contiguous()
)
So here is my attempt to create a hole in memory that the next allocation request is bigger than the hole and there is no free memory block remaining of the size of that request, yet enough of total free mem to allocate it, and it should fail to allocate, due to fragmentation, but it succeeds.
# this ensures we always test the same thing
buf = leave_free_mbs(1600)
# legend: [free block] {used block}
# [1601]
x1 = mem_get(512) # {512}[1089]
x2 = mem_get(512) # {512}{512}[577]
print(f"have {mem_free():4d}, reclaiming first 512")
del x1 # [512]{512}[577]
x3 = mem_get(1024) # shouldn't be able to allocate 1024 contiguous mem
print(f"have {mem_free():4d}")
which outputs:
consuming 4054MB to bring free mem to 1600MBs
have 1601, allocating 512
have 1089, allocating 512
have 577, reclaiming first 512
have 1089, allocating 1024
have 65
So the last call to allocate 1024MB succeeds despite supposedly having only two chunks, one of 512MB and another of ~576MB, separated by a 512MB of used chunk. If my allocation function allocates a contiguous memory, how is it then successful?
Am I doing something wrong?
Thank you.
Here is the whole program should you want to run it yourself.
Make sure to pip install nvidia-ml-py3
before you run it.
import pynvml, torch, gc
pynvml.nvmlInit()
id = torch.cuda.current_device()
def mem_free():
gc.collect()
torch.cuda.empty_cache()
handle = pynvml.nvmlDeviceGetHandleByIndex(id)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
return int( info.free / 2**20 )
def mem_report(): print(f"free mem={mem_free()}")
def mem_allocate_mbs(n, fatal=False):
" allocate n MBs, return the var holding it on success, None on failure "
if n < 6: return None # don't try to allocate less than 6MB
try:
d = int(2**9*n**0.5)
return torch.ones((d, d)).cuda().contiguous()
except Exception as e:
if not fatal: return None
raise e
def leave_free_mbs(n):
" consume whatever memory is needed so that n MBs are left free "
avail = mem_free()
assert avail > n, f"already have less available mem than desired {n}MBs"
consume = avail - n
print(f"consuming {consume}MB to bring free mem to {n}MBs")
return mem_allocate_mbs(consume, fatal=True)
def globals_unset(var_names):
" this is useful for re-running the cell, so that it resets the initial state or cleanup at the end of the cell"
for x in var_names:
if x in globals():
del globals()[x]
def mem_get(n):
print(f"have {mem_free():4d}, allocating {n}")
return mem_allocate_mbs(n, fatal=True)
globals_unset(['x1', 'x2', 'x3', 'buf'])
_=torch.ones(1).cuda()# preload
# this ensures we always test the same thing
buf = leave_free_mbs(1600)
# legend: [free block] {used block}
# [1600]
x1 = mem_get(512) # {512}[1092]
x2 = mem_get(512) # {512}{512}[576]
print(f"have {mem_free():4d}, reclaiming first 512")
del x1 # [512]{512}[576]
x3 = mem_get(1024) # shouldn't be able to allocate 1024 contiguous mem
print(f"have {mem_free():4d}")
# cleanup
globals_unset(['x1', 'x2', 'x3', 'buf'])