Potential Memory Leakage

J_B_28 · January 19, 2023, 9:45am

Hello, I am facing the potential GPU memory leakage problem, I made some simple test to measure the memory:

def train(data):
    model.train()
    for batch in train_loader:
        print("before: %.2f MB" % (torch.cuda.memory_allocated() / 1024 / 1024), flush=True)
        batch = batch.to(device)
        optimizer.zero_grad()
        out = model(batch.x)
        loss = F.nll_loss(out, batch.y)
        print("middle: %.2f MB" % (torch.cuda.memory_allocated() / 1024 / 1024), flush=True)
        loss.backward()
        optimizer.step()
        print("after: %.2f MB" % (torch.cuda.memory_allocated() / 1024 / 1024), flush=True)
    return float(loss)

And in my model.forward, I also set a memory measurement just before return:

def forward(self, x):
      ```(omitted)
      print("allocated: %.2f MB" % (torch.cuda.memory_allocated() / 1024 / 1024), flush=True)
      return out

Then, the output seems weird, at the first several batches:

before: 0.00 MB
allocated: 918.37 MB
middle: 900.69 MB
after: 46.32 MB
before: 46.32 MB
allocated: 1496.60 MB
middle: 1468.49 MB
after: 109.55 MB
before: 109.55 MB
allocated: 571.27 MB
middle: 562.04 MB
after: 129.10 MB

but after several batches, the value increases and is larger than it should be and becomes stable:

before: 7077.44 MB
allocated: 7951.10 MB
middle: 7933.39 MB
after: 7077.51 MB
before: 7077.51 MB
allocated: 8597.35 MB
middle: 8566.14 MB
after: 7077.21 MB
before: 7077.21 MB
allocated: 8363.79 MB
middle: 8337.63 MB
after: 7077.20 MB

Does memory leakage happen in my code? The gap between allocated and middle in my code is small, is this normal? Any help would be appreciated.

ptrblck · January 19, 2023, 10:01pm

Are you storing any tensors which might still be attached to the computation graph outside of the train method?
I would assume float(loss) would detach the tensor, so unsure what might be causing it.
In case you get stuck, could you try to post a minimal and executable code snippet to reproduce the issue, please?

J_B_28 · January 28, 2023, 12:31am

Thanks for your reply and sorry for the late response. I have found the problem. When I used the dataloader to get mini-batches, Previously I used like:

For batch in train_loader:
       batch = batch.to(device)

then I switched to :

For batch in train_loader:
       x = batch.x.to(device)
       y = batch.y.to(device)

It works. I found the second way can release the memory, while the first one cannot. I am not sure the actual reason for that.

ptrblck · January 28, 2023, 1:18am

That’s interesting to see. What is your DataLoader returning? It seems be be an object containing internal .x and .y attributes which also provides the interface to the to() operation so it cannot be a plain list.

J_B_28 · January 28, 2023, 1:47am

Correct, it’s a Data object containing each batch’s properties, like .x and .y etc., it makes me confused why it will cause the problem.