Cuda memory not released when copying tensor to a shared memory

It seems that Cuda memory won’t be released if it is copied into a shared memory as a whole, potentially because there’s still a reference to it somewhere. However, if I only copy the tensor data, the Cuda memory could be released upon the deletion of the tensor.

Please find a sample code to reproduce the issue below [1]. In the current version, everything works as expected, but if you set shm_data=False upon calling the run function, you will be able to reproduce the issue with the Cuda memory.

I got the following output when shm_data=True:

[P0] [GPU Mem (MB)] total: 16130.5, reserved = 102.0, allocated: 60.015625, free: 41.984375, step: 25
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 122.0, allocated: 52.015625, free: 69.984375, step: 50
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 122.0, allocated: 36.015625, free: 85.984375, step: 75
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 122.0, allocated: 44.015625, free: 77.984375, step: 100
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 124.015625, free: 17.984375, step: 125
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 36.015625, free: 105.984375, step: 150
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 60.015625, free: 81.984375, step: 175
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 28.015625, free: 113.984375, step: 200
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 108.015625, free: 33.984375, step: 225
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 52.015625, free: 89.984375, step: 250
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 124.015625, free: 17.984375, step: 275
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 28.015625, free: 113.984375, step: 300

And the following when shm_data=False:

[P0] [GPU Mem (MB)] total: 16130.5, reserved = 162.0, allocated: 108.015625, free: 53.984375, step: 25
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 202.0, allocated: 176.015625, free: 25.984375, step: 50
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 262.0, allocated: 260.015625, free: 1.984375, step: 75
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 322.0, allocated: 308.015625, free: 13.984375, step: 100
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 422.0, allocated: 408.015625, free: 13.984375, step: 125
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 482.0, allocated: 460.015625, free: 21.984375, step: 150
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 522.0, allocated: 504.015625, free: 17.984375, step: 175
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 582.0, allocated: 548.015625, free: 33.984375, step: 200
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 642.0, allocated: 640.015625, free: 1.984375, step: 225
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 722.0, allocated: 700.015625, free: 21.984375, step: 250
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 782.0, allocated: 764.015625, free: 17.984375, step: 275
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 862.0, allocated: 828.015625, free: 33.984375, step: 300

Notice the huge difference in the number of allocated memory (more than 10x after 300 steps on avg).
Even though I have managed to resolve the issue on my end (by copying tensor data), I still can’t come up with why copying the full tensor instead of just the data caused an issue.

Any thoughts?

PyTorch version: 1.9.0

[1] Sample code to reproduce the issue

import torch
import random

def log_gpu_mem(local_rank=0, step=0, unit="MB"):
    u2d = {"KB": 1024, "MB": 1024 ** 2}
    d = u2d[unit]

    total = torch.cuda.get_device_properties(local_rank).total_memory / d
    reserved = torch.cuda.memory_reserved(local_rank) / d
    allocated = torch.cuda.memory_allocated(local_rank) / d
    free = reserved - allocated

    s = f"[GPU Mem ({unit})] total: {total}, reserved = {reserved}, allocated: {allocated}, free: {free}"
    print(f"[P{local_rank}] " + s + f", step: {step}")


def run(steps, batch_sz, layer_sz, shm_data=True):
    model = torch.nn.Linear(layer_sz, layer_sz).to(0)
    optim = torch.optim.Adam(model.parameters())

    track = []

    shm = torch.zeros((batch_sz, layer_sz))
    shm.share_memory_()

    for step in range(1, steps + 1):
        inp = torch.rand((batch_sz, layer_sz)).to(0)
        out = model(inp)

        track.append((step, out))

        if shm_data:
            shm.copy_(out.data)
        else:
            shm.copy_(out)

        if random.randint(1, 3) == 1:
            chosen = random.randint(1, 3)
            while True:
                s, out = track[0]
                if s < step - chosen:
                    del out
                    del track[0]
                    continue

                if s > step - chosen:
                    break

                grad = torch.rand(batch_sz, layer_sz).to(0)

                optim.zero_grad()
                out.backward(grad)
                optim.step()

                del out
                del track[0]

        if step % 25 == 0:
            log_gpu_mem(step=step)


if __name__ == "__main__":
    run(300, 1024, 1024, shm_data=True)

Have you tried: empty_cache : torch.cuda.empty_cache — PyTorch 1.9.1 documentation

pytorch caches gpu memory by default.

Yes, I have tried doing that as well. From the sample code, you can also just add “torch.cuda.empty_cache()” after the two “del track[0]” if you want to test it.

As per my current understanding, if the gpu memory is indeed cached, the allocated memory will not increase since new tensors can acquire the unused cached memory. However, in the sample output above, you can see that the allocated gpu memory keeps increasing (if you keep running it for larger number of steps, can even get Cuda out of memory).

You are right sorry stand corrected

copy_ supports backprop, so it records “out”. using out.data avoids this, as all autograd functionality is skipped.

But if the shared memory is overwritten (by another copy_ with diff tensor in this case), wouldn’t the ref/mem be released?

apparently not, your shm.grad_fn probably says “CopyBackward”, and copy_ is just a memory rewriter (with partial rewrites possible), it is not erasing tensor’s update history.

but what’s the point of keeping “CopyBackward” in the shm.grad_fn if I alr do an in-place copy operation? because even if I perform any backprop using the shm tensor, it shouldn’t affect the old tensor at all right – and hence I think the old “CopyBackward” should be useless and hence be collected by the gc?

wouldn’t the version field inside the shm tensor be sufficient to detect and throw an error if I try to backprop against something produced by an outdated tensor (e.g. altered with an in-place op)?

I suspect the problem is that copy_ must also work on slices and with multiple views, so it is a simplistic operation. Instead, you’re expected to copy in no_grad() context (same as using .data).

w.r.t. CopyBackward, i think these are chained, not replaced (again, to support writes into slices).