Hi , below I have two versions of the same function. In the first, the tensors for “flipped” and “salted” are created on the gpu. In the second, they are created first on the cpu and then pushed to the gpu. When I run my experiment with the second function, since I use up alot of memory in different parts of my experiment too, I run out of memory. When I use the first function, I don’t. I’m wondering - where is the extra memory coming from? Thanks!
def add_s_and_p_noise(s_and_p, p, gpu):
q = 0.5
s_and_p_cloned = s_and_p.clone()
with torch.cuda.device(gpu):
flipped = torch.cuda.FloatTensor(s_and_p.shape).uniform_() < p
salted = torch.cuda.FloatTensor(s_and_p.shape).uniform_() > q
peppered = ~salted
s_and_p_cloned[flipped & salted] = 1.0
s_and_p_cloned[flipped & peppered] = 0.0
with torch.cuda.device(gpu):
del flipped
del salted
del peppered
torch.cuda.empty_cache()
return s_and_p_cloned
def add_s_and_p_noise_two(s_and_p, p, gpu):
s_and_p_cloned = s_and_p.clone()
q = 0.5
flipped = (torch.rand(s_and_p.shape) < p).to(gpu)
salted = (torch.rand(s_and_p.shape) > q).to(gpu)
peppered = ~salted
s_and_p_cloned[flipped & salted] = 1.0
s_and_p_cloned[flipped & peppered] = 0.0
with torch.cuda.device(gpu):
del flipped
del salted
del peppered
torch.cuda.empty_cache()
return s_and_p_cloned
Does the original input requires gradient? If so, the main difference I can see is that all the intermediate results that the autograd needs to save will be on different devices.
But otherwise, i don’t see why one would consume more memory. How do you measure the memory usage of each of these functions?
I would advise against doing:
with torch.cuda.device(gpu):
del flipped
del salted
del peppered
torch.cuda.empty_cache()
The del will happen anyway when you exit the function and the empty cache is not going to be doing anything but slow down your code.
Ok.
Just be careful though that you are playing at the limit of the memory fragmentation and so any small change in your code can make you OOM for seemingly unrelated reasons.
It would be more reliable to reduce batch size or something similar to make sure you are below the max memory by a safe margin.
True! thank you. But im still curious and lost as to the reason behind the change in the code above. And also why the del did make a difference, given that you thought it wouldn’t.
The del don’t make any change but the empty_cache() does something.
Basically it agressively frees up memory (slowing down the process) and changing the memory fragmentation. If you’re lucky, this fragmentation will be less and you won’t OOM.
Because the cuda driver does some smart things to reduce fragementation in average. But if you end up in a worst case for its heuristic, it can actually be worst than our allocator (it is rare ).
Allocation is a fairly hard problem because you don’t know about the future, so you cannot make a globally optimal choice for the final state. So you have to use heuristics. But since you don’t want to reshuffle the memory after it’s allocated, you cannot really predict final fragmentation.