The demo code is listed below. Why after I used the torch.cuda.empty_cache(), only part of the reserved space generated by the forwarding of Net could be released? For the second epoch, the reserved_space will get larger while doing the same operation as the first epoch. And how does the reserved_memory work? The reserved memory always adds in some unexpected places.
import torch
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv = torch.nn.Conv2d(3, 28, (3, 3)) # in out kernel size
self.maxpool1 = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
self.maxpool2 = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
def forward(self, x):
x = self.conv(x)
x = self.maxpool1(x)
x = self.maxpool2(x)
return x
def train():
net = Net().to('cuda')
for i in range(5):
with torch.no_grad():
frames_batches = torch.randn(512 , 3, 224, 224).to('cuda')
# before forward 1st | memory_reserved: 296MB | memory allocated: 294MB
pred = net(frames_batches)
# after forward 1st epoch | memory_reserved: 5014MB | memory allocated: 465MB
# after forward 2nd epoch | memory_reserved: 5688MB | memory allocated: 465MB
torch.cuda.empty_cache()
# after empty_cache 1st epoch | memory_reserved: 1644MB | memory allocated: 465MB
if __name__ == '__main__':
train()
The caching allocator takes memory in blocks and if the block isn’t completely empty, it cannot be deallocated.
I can recommend torch.cuda.memory_summary() for looking at it in more detail.
@tom Thank you for your in-time reply!
I can understand why the reserved space is getting larger increasingly. But from after forward 1st epoch | memory_reserved: 5014MB, why the cached memory can’t be released automatically, and I have to use torch.cuda.empty_cache() to clean it. As I know, the caching allocator which is like Python reference will automatically release the cached space while there is no reference to the variable.
Besides, I have another question. The memory_summary of after forward 1st epoch is listed below. Why the Non-releasable memory is generated during the forward of the net? Thanks!
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 476675 KB | 5012 MB | 5526 MB | 5061 MB |
| from large pool | 476672 KB | 5012 MB | 5526 MB | 5061 MB |
| from small pool | 3 KB | 0 MB | 0 MB | 0 MB |
|---------------------------------------------------------------------------|
| Active memory | 476675 KB | 5012 MB | 5526 MB | 5061 MB |
| from large pool | 476672 KB | 5012 MB | 5526 MB | 5061 MB |
| from small pool | 3 KB | 0 MB | 0 MB | 0 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 5014 MB | 5014 MB | 5014 MB | 0 B |
| from large pool | 5012 MB | 5012 MB | 5012 MB | 0 B |
| from small pool | 2 MB | 2 MB | 2 MB | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 1178 MB | 1178 MB | 1521 MB | 351331 KB |
| from large pool | 1176 MB | 1176 MB | 1519 MB | 351232 KB |
| from small pool | 1 MB | 1 MB | 2 MB | 99 KB |
|---------------------------------------------------------------------------|
| Allocations | 4 | 17 | 160 | 156 |
| from large pool | 2 | 4 | 6 | 4 |
| from small pool | 2 | 16 | 154 | 152 |
|---------------------------------------------------------------------------|
| Active allocs | 4 | 17 | 160 | 156 |
| from large pool | 2 | 4 | 6 | 4 |
| from small pool | 2 | 16 | 154 | 152 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 5 | 5 | 5 | 0 |
| from large pool | 4 | 4 | 4 | 0 |
| from small pool | 1 | 1 | 1 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 2 | 3 | 30 | 28 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 1 | 3 | 29 | 28 |
|===========================================================================|
For efficiency reasons, tensors are not allocated individually but in blocks.
Non-releasable memory is the empty part of a memory block that also has some parts of it allocated.
Obviously frames_batches and pred are still on the GPU. Also, using gc.collect() before emtpy_cache helps Python not hold on to Tensors that are not needed anymore.