It seems that Cuda memory won’t be released if it is copied into a shared memory as a whole, potentially because there’s still a reference to it somewhere. However, if I only copy the tensor data, the Cuda memory could be released upon the deletion of the tensor.
Please find a sample code to reproduce the issue below [1]. In the current version, everything works as expected, but if you set shm_data=False
upon calling the run
function, you will be able to reproduce the issue with the Cuda memory.
I got the following output when shm_data=True
:
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 102.0, allocated: 60.015625, free: 41.984375, step: 25
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 122.0, allocated: 52.015625, free: 69.984375, step: 50
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 122.0, allocated: 36.015625, free: 85.984375, step: 75
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 122.0, allocated: 44.015625, free: 77.984375, step: 100
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 124.015625, free: 17.984375, step: 125
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 36.015625, free: 105.984375, step: 150
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 60.015625, free: 81.984375, step: 175
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 28.015625, free: 113.984375, step: 200
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 108.015625, free: 33.984375, step: 225
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 52.015625, free: 89.984375, step: 250
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 124.015625, free: 17.984375, step: 275
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 142.0, allocated: 28.015625, free: 113.984375, step: 300
And the following when shm_data=False
:
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 162.0, allocated: 108.015625, free: 53.984375, step: 25
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 202.0, allocated: 176.015625, free: 25.984375, step: 50
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 262.0, allocated: 260.015625, free: 1.984375, step: 75
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 322.0, allocated: 308.015625, free: 13.984375, step: 100
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 422.0, allocated: 408.015625, free: 13.984375, step: 125
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 482.0, allocated: 460.015625, free: 21.984375, step: 150
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 522.0, allocated: 504.015625, free: 17.984375, step: 175
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 582.0, allocated: 548.015625, free: 33.984375, step: 200
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 642.0, allocated: 640.015625, free: 1.984375, step: 225
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 722.0, allocated: 700.015625, free: 21.984375, step: 250
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 782.0, allocated: 764.015625, free: 17.984375, step: 275
[P0] [GPU Mem (MB)] total: 16130.5, reserved = 862.0, allocated: 828.015625, free: 33.984375, step: 300
Notice the huge difference in the number of allocated memory (more than 10x after 300 steps on avg).
Even though I have managed to resolve the issue on my end (by copying tensor data), I still can’t come up with why copying the full tensor instead of just the data caused an issue.
Any thoughts?
PyTorch version: 1.9.0
[1] Sample code to reproduce the issue
import torch
import random
def log_gpu_mem(local_rank=0, step=0, unit="MB"):
u2d = {"KB": 1024, "MB": 1024 ** 2}
d = u2d[unit]
total = torch.cuda.get_device_properties(local_rank).total_memory / d
reserved = torch.cuda.memory_reserved(local_rank) / d
allocated = torch.cuda.memory_allocated(local_rank) / d
free = reserved - allocated
s = f"[GPU Mem ({unit})] total: {total}, reserved = {reserved}, allocated: {allocated}, free: {free}"
print(f"[P{local_rank}] " + s + f", step: {step}")
def run(steps, batch_sz, layer_sz, shm_data=True):
model = torch.nn.Linear(layer_sz, layer_sz).to(0)
optim = torch.optim.Adam(model.parameters())
track = []
shm = torch.zeros((batch_sz, layer_sz))
shm.share_memory_()
for step in range(1, steps + 1):
inp = torch.rand((batch_sz, layer_sz)).to(0)
out = model(inp)
track.append((step, out))
if shm_data:
shm.copy_(out.data)
else:
shm.copy_(out)
if random.randint(1, 3) == 1:
chosen = random.randint(1, 3)
while True:
s, out = track[0]
if s < step - chosen:
del out
del track[0]
continue
if s > step - chosen:
break
grad = torch.rand(batch_sz, layer_sz).to(0)
optim.zero_grad()
out.backward(grad)
optim.step()
del out
del track[0]
if step % 25 == 0:
log_gpu_mem(step=step)
if __name__ == "__main__":
run(300, 1024, 1024, shm_data=True)