Why the reserved_memory space generated by the forwarding of Net can't be released?

MooreManor · December 8, 2021, 6:37am

Questions and Help

The demo code is listed below. Why after I used the torch.cuda.empty_cache(), only part of the reserved space generated by the forwarding of Net could be released? For the second epoch, the reserved_space will get larger while doing the same operation as the first epoch. And how does the reserved_memory work? The reserved memory always adds in some unexpected places.

import torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = torch.nn.Conv2d(3, 28, (3, 3)) # in out kernel size
        self.maxpool1 = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.maxpool2 = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
    def forward(self, x):
        x = self.conv(x)
        x = self.maxpool1(x)
        x = self.maxpool2(x)
        return x

def train():
    net = Net().to('cuda')
    for i in range(5):
        with torch.no_grad():
            frames_batches = torch.randn(512 , 3, 224, 224).to('cuda')
            
            # before forward 1st | memory_reserved: 296MB | memory allocated: 294MB
            pred = net(frames_batches)
            # after forward 1st epoch | memory_reserved: 5014MB | memory allocated: 465MB
            # after forward 2nd epoch | memory_reserved: 5688MB | memory allocated: 465MB
            torch.cuda.empty_cache()
            # after empty_cache 1st epoch | memory_reserved: 1644MB | memory allocated: 465MB
if __name__ == '__main__':
    train()

tom · December 8, 2021, 8:05am

The caching allocator takes memory in blocks and if the block isn’t completely empty, it cannot be deallocated.
I can recommend torch.cuda.memory_summary() for looking at it in more detail.

Best regards

Thomas

MooreManor · December 8, 2021, 8:41am

@tom Thank you for your in-time reply!
I can understand why the reserved space is getting larger increasingly. But from after forward 1st epoch | memory_reserved: 5014MB, why the cached memory can’t be released automatically, and I have to use torch.cuda.empty_cache() to clean it. As I know, the caching allocator which is like Python reference will automatically release the cached space while there is no reference to the variable.

Besides, I have another question. The memory_summary of after forward 1st epoch is listed below. Why the Non-releasable memory is generated during the forward of the net? Thanks!

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  476675 KB |    5012 MB |    5526 MB |    5061 MB |
|       from large pool |  476672 KB |    5012 MB |    5526 MB |    5061 MB |
|       from small pool |       3 KB |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| Active memory         |  476675 KB |    5012 MB |    5526 MB |    5061 MB |
|       from large pool |  476672 KB |    5012 MB |    5526 MB |    5061 MB |
|       from small pool |       3 KB |       0 MB |       0 MB |       0 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    5014 MB |    5014 MB |    5014 MB |       0 B  |
|       from large pool |    5012 MB |    5012 MB |    5012 MB |       0 B  |
|       from small pool |       2 MB |       2 MB |       2 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |    1178 MB |    1178 MB |    1521 MB |  351331 KB |
|       from large pool |    1176 MB |    1176 MB |    1519 MB |  351232 KB |
|       from small pool |       1 MB |       1 MB |       2 MB |      99 KB |
|---------------------------------------------------------------------------|
| Allocations           |       4    |      17    |     160    |     156    |
|       from large pool |       2    |       4    |       6    |       4    |
|       from small pool |       2    |      16    |     154    |     152    |
|---------------------------------------------------------------------------|
| Active allocs         |       4    |      17    |     160    |     156    |
|       from large pool |       2    |       4    |       6    |       4    |
|       from small pool |       2    |      16    |     154    |     152    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       5    |       5    |       5    |       0    |
|       from large pool |       4    |       4    |       4    |       0    |
|       from small pool |       1    |       1    |       1    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       2    |       3    |      30    |      28    |
|       from large pool |       1    |       1    |       1    |       0    |
|       from small pool |       1    |       3    |      29    |      28    |
|===========================================================================|

tom · December 8, 2021, 9:05am

For efficiency reasons, tensors are not allocated individually but in blocks.
Non-releasable memory is the empty part of a memory block that also has some parts of it allocated.
Obviously frames_batches and pred are still on the GPU. Also, using gc.collect() before emtpy_cache helps Python not hold on to Tensors that are not needed anymore.

Best regards

Thomas

MooreManor · December 8, 2021, 9:38am

@tom Thanks! Now I can understand why this thing happened