Nvidia-smi does not drop after empty_cache although cuda.memory_summary() says there is no current usage

I am trying to run multiple processes that shares the same GPU. I need to release GPU memory so that another process can use it.
By reading the docs I understand that pytorch does not release GPU memory right away, and I need to cuda.empty_cache

But I found that nvidia-smi stays high even after empty_cache. cuda.memory_summary says that current usage is zero.

What am I missing?

import torch
from torchvision import models

net = models.alexnet(pretrained=True)
net.cuda()
del net

torch.cuda.empty_cache()

print(f"{torch.cuda.memory_allocated()} {torch.cuda.max_memory_allocated()}") # 0 244797440
print(f"{torch.cuda.memory_reserved()} {torch.cuda.max_memory_reserved()}") # 0 257949696
print(torch.cuda.memory_summary())

import time
time.sleep(100)

output of nvidia-smi

Thu Feb 10 06:45:21 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00   Driver Version: 460.106.00   CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  On   | 00000000:AF:00.0 Off |                  N/A |
| 22%   50C    P2    73W / 250W |    476MiB / 12210MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     32247      C   python3                           471MiB |
+-----------------------------------------------------------------------------+

output of torch.cuda.memory_summary()

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |  239060 KB |  239060 KB |  239060 KB |
|       from large pool |       0 B  |  238928 KB |  238928 KB |  238928 KB |
|       from small pool |       0 B  |     132 KB |     132 KB |     132 KB |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |  239060 KB |  239060 KB |  239060 KB |
|       from large pool |       0 B  |  238928 KB |  238928 KB |  238928 KB |
|       from small pool |       0 B  |     132 KB |     132 KB |     132 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |  251904 KB |  251904 KB |  251904 KB |
|       from large pool |       0 B  |  249856 KB |  249856 KB |  249856 KB |
|       from small pool |       0 B  |    2048 KB |    2048 KB |    2048 KB |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |   21236 KB |   28613 KB |   28613 KB |
|       from large pool |       0 B  |   19280 KB |   26528 KB |   26528 KB |
|       from small pool |       0 B  |    2044 KB |    2085 KB |    2085 KB |
|---------------------------------------------------------------------------|
| Allocations           |       0    |      16    |      16    |      16    |
|       from large pool |       0    |       7    |       7    |       7    |
|       from small pool |       0    |       9    |       9    |       9    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |      16    |      16    |      16    |
|       from large pool |       0    |       7    |       7    |       7    |
|       from small pool |       0    |       9    |       9    |       9    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       5    |       5    |       5    |
|       from large pool |       0    |       4    |       4    |       4    |
|       from small pool |       0    |       1    |       1    |       1    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       4    |       4    |       4    |
|       from large pool |       0    |       2    |       2    |       2    |
|       from small pool |       0    |       2    |       2    |       2    |
|===========================================================================|

I am using torch 1.5.1 and torchvision 0.6.1 on ubuntu 18.04 machine.

The first CUDA call will create the CUDA context, which will take some memory on the GPU (depending on the compute capability of the device, CUDA versions, number of loaded kernels etc.) and will not be reported by PyTorch but is visible via nvidia-smi.

Thanks @ptrblck. That makes sense.

Is there a way to remove CUDA context so that GPU memory usage can drop to zero?

I have about 10 processes that shares the same GPU, so if each processes create CUDA context it will be ~4-5GiB, and that might break the system.

No, you won’t be able to unload the context and would need to terminate the Python process.

Thanks. I will just terminate the process when necessary.