Is there anyway to let pytorch reserve less GPU memory? I found it is reserving GPU memory very aggressively even for simple computation, which causes CUDA OOM for large computations. Here is some code snippet
In [1]: import torch as tc
In [2]: m = tc.nn.Sequential(
...: tc.nn.Linear(1000, 1000),
...: tc.nn.Linear(1000, 1000),
...: tc.nn.Linear(1000, 1000),
...: tc.nn.Linear(1000, 1),
...: ).cuda()
In [3]: x = tc.randn(1000, 1000).cuda()
In [4]: m(x).sum().backward() # a forward pass followed by a backward pass
This causes GPU memory usage of 1285MB from nvidia-smi
| 0 N/A N/A 31758 C ...-3.7.6/bin/python3.7 1285MiB |
But there are only very few tensors allocated on gpu
In [10]: import gc
In [11]: [obj.shape for obj in gc.get_objects() if tc.is_tensor(obj) and obj.device.type != 'cpu']
Out[11]:
[torch.Size([1000, 1000]),
torch.Size([1000, 1000]),
torch.Size([1000]),
torch.Size([1000, 1000]),
torch.Size([1000]),
torch.Size([1000, 1000]),
torch.Size([1000]),
torch.Size([1, 1000]),
torch.Size([1])]
I don’t want to call torch.cuda.empty_cache() since it is super expensive in training loop.
Please advice, thank you!
The rest of the allocation is caused by the CUDA context, which loads all native PyTorch CUDA kernels as well as CUDA kernels from libraries such as cuDNN, cublas etc.
Thank you for your reply
Are you suggesting the rest of the allocation is CUDA libraries (which should then take constant space)?
It doesn’t seem to be the case when I increase the input size
In [1]: import torch as tc
In [2]: m = tc.nn.Sequential(
...: tc.nn.Linear(1000, 1000),
...: tc.nn.Linear(1000, 1000),
...: tc.nn.Linear(1000, 1000),
...: tc.nn.Linear(1000, 1),
...: ).cuda()
In [3]: x = tc.randn(100000, 1000).cuda() # 400MB
In [7]: m(x).sum().backward()
In [8]: tc.cuda.memory_allocated() # 1600MB, this number makes sense to me
Out[8]: 1627630080
In [9]: tc.cuda.memory_reserved() # why double the above size?
Out[9]: 3338665984
And after this call, nvidia-smi shows even bigger memory footprint
IIUC, 4 CUDA tensors of 400MB should be created (one for input, 3 for intermediate result right after each linear layer for gradient computation). And the rest should be negligible. So the total CUDA memory footprint should be 1.6GB. But tc.cuda.memory_reserved() shows 200% as much, and nvidia-smi uses even more.
memory_reserved shows the allocated and cached memory. You could use the memory_summary again to check the different allocations. nvidia-smi shows the overall memory usage. If no other processes are running, please share an executable code snippet to reproduce the memory increase seen via nvidia-smi.
You meant something like this? After running this code, CUDA memory footprint seen at nvidia-smi becomes 3.2GB
import torch as tc
m = tc.nn.Sequential(
tc.nn.Linear(1000, 1000),
tc.nn.Linear(1000, 1000),
tc.nn.Linear(1000, 1000),
tc.nn.Linear(1000, 1),
).cuda()
x = tc.randn(100000, 1000).cuda() # 400MB
m(x).sum().backward()
+-------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================|
| 0 N/A N/A 79663 C ...n-3.7.6/bin/python3.7 3281MiB |
+-------------------------------------------------------------------------+
Thanks!
As you can see in the memory_summary(), PyTorch reserves ~2GB so given the model size + CUDA context + the PyTorch cache, the memory usage is expected:
| GPU reserved memory | 2038 MB | 2038 MB | 2038 MB | 0 B |
| from large pool | 2036 MB | 2036 MB | 2036 MB | 0 B |
| from small pool | 2 MB | 2 MB | 2 MB | 0 B |
If you want to release the cache, use torch.cuda.empty_cache(). This will synchronize your code thus slowing it down, but would allow other applications to use this memory.