Problem
Summary: The non-referenced tensor object still takes GPU memory.
I monitor the GPU memory usage through nvidia-smi
Code:
class FCBlock(nn.Module):
def __init__(self, in_channel, hidden_channel, out_channel, n_blocks):
super().__init__()
self.net = nn.Sequential()
self.net.append(nn.Linear(in_channel, hidden_channel))
self.net.append(ReLU())
for _ in range(n_blocks):
self.net.append(nn.Linear(hidden_channel, hidden_channel))
self.net.append(ReLU())
self.net.append(nn.Linear(hidden_channel, out_channel))
def forward(self, x):
return self.net(x)
def T_forward(surf, y, direction):
F, g = F_forward(surf, y)
g = g / g.norm(dim=-1, keepdim=True).clamp(min=1e-6)
return y + g * _D(y) * direction
def F_forward(surf, x):
x.requires_grad=True
F = surf(x)
Fx = torch.autograd.grad(F, [x], grad_outputs=torch.ones_like(F),
retain_graph=True, create_graph=True)[0]
return F, Fx
def test_leak(surf):
y = torch.randn((8192, 100, 3)).to(V().cfg.device)
direction = torch.ones((8192, )).view(8192, 1, 1).to(V().cfg.device)
T_forward(surf, y, direction) # consumes 20GB GPU memory, and stay unreleased
T_forward(surf, y, direction) # consumes another 20GB GPU memory, and stay unreleased
if __name__ == "__main__":
surf = FCBlock(3, 512, 1, 2)
test_leak(surf)
print('hi')
while True: pass
Phenomenon:
(1) The first T_forward
consumes 20GB GPU memory, and stay unreleased
(2) The second T_forward
consumes another 20GB GPU memory, and stay unreleased
(3) Both GPU memory are not released even after the end of test_leak
I monitor the GPU memory usage through nvidia-smi
Analysis:
I remember that gc in python auto collect the object with zero reference. But it seems like gc isn’t working in this code, since after the end of test_leak
, no tensor object is referenced and all gpu tensor should be released. But nvidia-smi
tells me they’re not released.
Why?
Environment
torch 2.0.1+cu118
Ubuntu 22.04