Memory difference depending on whether the tensor was creating on gpu or pushed to gpu? Strange

Hi , below I have two versions of the same function. In the first, the tensors for “flipped” and “salted” are created on the gpu. In the second, they are created first on the cpu and then pushed to the gpu. When I run my experiment with the second function, since I use up alot of memory in different parts of my experiment too, I run out of memory. When I use the first function, I don’t. I’m wondering - where is the extra memory coming from? Thanks!

def add_s_and_p_noise(s_and_p, p, gpu):
    q = 0.5
    s_and_p_cloned = s_and_p.clone()

    with torch.cuda.device(gpu):
        flipped = torch.cuda.FloatTensor(s_and_p.shape).uniform_() < p
        salted = torch.cuda.FloatTensor(s_and_p.shape).uniform_() > q
    peppered = ~salted
    s_and_p_cloned[flipped & salted] = 1.0
    s_and_p_cloned[flipped & peppered] = 0.0
    with torch.cuda.device(gpu):
        del flipped
        del salted
        del peppered
        torch.cuda.empty_cache()

    return s_and_p_cloned

def add_s_and_p_noise_two(s_and_p, p, gpu):

    s_and_p_cloned = s_and_p.clone()
    q = 0.5
    flipped = (torch.rand(s_and_p.shape) < p).to(gpu)
    salted = (torch.rand(s_and_p.shape) > q).to(gpu)
    peppered = ~salted
    s_and_p_cloned[flipped & salted] = 1.0
    s_and_p_cloned[flipped & peppered] = 0.0
    with torch.cuda.device(gpu):
        del flipped
        del salted
        del peppered
        torch.cuda.empty_cache()

    return  s_and_p_cloned

Hi,

Does the original input requires gradient? If so, the main difference I can see is that all the intermediate results that the autograd needs to save will be on different devices.

But otherwise, i don’t see why one would consume more memory. How do you measure the memory usage of each of these functions?

I would advise against doing:

    with torch.cuda.device(gpu):
        del flipped
        del salted
        del peppered
        torch.cuda.empty_cache()

The del will happen anyway when you exit the function and the empty cache is not going to be doing anything but slow down your code.

Without this, I end up having a memory crash error though…

Ok.
Just be careful though that you are playing at the limit of the memory fragmentation and so any small change in your code can make you OOM for seemingly unrelated reasons.
It would be more reliable to reduce batch size or something similar to make sure you are below the max memory by a safe margin.

1 Like

True! thank you. But im still curious and lost as to the reason behind the change in the code above. And also why the del did make a difference, given that you thought it wouldn’t.

The del don’t make any change but the empty_cache() does something.
Basically it agressively frees up memory (slowing down the process) and changing the memory fragmentation. If you’re lucky, this fragmentation will be less and you won’t OOM.

Why “if you’re lucky”? Shouldn’t objectively be better if you deleted some things?

Because the cuda driver does some smart things to reduce fragementation in average. But if you end up in a worst case for its heuristic, it can actually be worst than our allocator (it is rare :smiley: ).
Allocation is a fairly hard problem because you don’t know about the future, so you cannot make a globally optimal choice for the final state. So you have to use heuristics. But since you don’t want to reshuffle the memory after it’s allocated, you cannot really predict final fragmentation.

1 Like