Confusion on tensor's memory usage

I recently noticed an interesting scenario that led me to ask this question. I have a float tensor containing 3 elements. Upon moving this tensor to GPU, the nvidia-smi reports 1089MiB memory consumption. Please see below ipython notebook more details:

$ ipython
Python 3.9.13 (main, Aug 25 2022, 23:26:10) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch, sys

In [2]: torch.__version__
Out[2]: '1.9.0+cu102'

In [3]: torch.cuda.is_available()
Out[3]: True

In [4]: a = torch.ones((1, 3))

In [5]: a.to("cuda:0")
Out[5]: tensor([[1., 1., 1.]], device='cuda:0')

In [6]: a
Out[6]: tensor([[1., 1., 1.]])

In [7]: a.dtype
Out[7]: torch.float32

In [8]: # Source: https://discuss.pytorch.org/t/how-to-know-the-memory-allocated-for-a-tensor-on-gpu/28537/2

In [9]: a.element_size() * a.nelement()
Out[9]: 12

In [10]: # Source: https://stackoverflow.com/a/54365012

In [11]: sys.getsizeof(a.storage())
Out[11]: 68

Theoretically, a C-based array of datatype float having 3 elements should take (4 bytes per float * 3 floats =) 12 bytes of memory. However, I understand that a tensor requires extra metadata. Therefore, the tensor will consume more memory than the C-based array.

Anyway, based on the two references [1] and [2], I computed the size of the allocated memory, but as you have noticed already, they are not equal, i.e. , 12 != 68.

On the other hand, when I moved this tensor to GPU, the nvidia-smi reports 1089MiB memory consumption. Below is the output of the nvidia-smi command:

$ nvidia-smi 
Tue Oct  4 16:08:54 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   50C    P8    13W /  N/A |   2513MiB /  7982MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1554      G   /usr/lib/xorg/Xorg                160MiB |
|    0   N/A  N/A      2820      G   /usr/lib/xorg/Xorg                665MiB |
|    0   N/A  N/A      3001      G   /usr/bin/gnome-shell              105MiB |
|    0   N/A  N/A      3614      G   ...763400436228628087,131072      397MiB |
|    0   N/A  N/A     39131      G   ...RendererForSitePerProcess       78MiB |
|    0   N/A  N/A    141097      C   ...conda/envs/ray/bin/python     1089MiB |
+-----------------------------------------------------------------------------+

GPU memory is more precious than anything else in the world!!! This is why I can’t digest the memory usage reported by nvidia-smi.

What’s wrong here?

The initialization will create the CUDA context loading all kernels for your GPU architecture and is thus expected. The size of the context depends on the CUDA version, your GPU, the number of kernels in loaded CUDA libs as well as native PyTorch kernels.
You could update to CUDA 11.7 and enable lazy module loading via CUDA_MODULE_LOADING=LAZY which will load kernels only if they are needed and will thus reduce the context size.

Thank you @ptrblck for your wonderful response.

I must say that CUDA context (kernels etc) is really heavy!!! isn’t it?

I did the same analysis on PyTorch version 1.12.0+cu102 and found that memory usage is lowered to 867MiB.

Finally, thank you very much for telling me about CUDA_MODULE_LOADING=LAZY environment variable.

Yes, it is which is why I’m really happy to see the lazy module loading utility which reduces the CUDA context size significantly by only loading the needed kernels. This util. was btw. further improved in CUDA 11.8, which was released today and which we should hopefully support soon.

BTW, to ensure we are on the same page, I did not set CUDA_MODULE_LOADING=LAZY when I reported 867MiB memory usage. This is because these statistics were recorded with PyTorch version 1.12.0+cu102.

In summary, the loading CUDA context (along with a tiny tensor) with PyTorch version 1.9.0+cu102 and PyTorch version 1.12.0+cu102 took 1089MiB and 867MiB, respectively.

Nevertheless, I will try the same with CUDA 11 later on.

Thank you again.

Yes, the env variable takes effect in CUDA 11.7+ and won’t change anything in CUDA 10.2. The context size reduction between PyTorch 1.9.0 and 1.12.0 would come from loading some modules lazily via the framework directly as well as a reduction in the number of kernels.

1 Like

Thanks @ptrblck for the complete answer.