[Solved] Why does a cuda float tensor with 64 million floats use ~512MB GPU?

I do:

In [1]: import torch

In [2]: a = torch.rand(1024*1024*64)

In [3]: a_cuda = a.cuda()

In [4]: type(a_cuda)
Out[4]: torch.cuda.FloatTensor

Then I do nvidia-smi

What I expect to see:

 [something something] 256MB

(ie 64 million * 4 bytes per float)

What I actually see:

Tue Aug 15 12:11:49 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.77                 Driver Version: 361.77                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   36C    P0    39W / 150W |    502MiB /  7618MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     52947    C   /mldata/conda/envs/pytorch/bin/python          500MiB |
+-----------------------------------------------------------------------------+

ie ~512MB.

Why is this?

(Note, interestingly, if I create another 64 million float tensor, the memory only increases by ~256MB, as I’d expect. I thought maybe some gc thing, but calling:

gc.collect()
gc.collect()
gc.collect()

… memory is still ~756MB:

In [1]: import torch

In [2]: a = torch.rand(1024*1024*64)

In [3]: a_cuda = a.cuda()

In [4]: type(a_cuda)
Out[4]: torch.cuda.FloatTensor

In [5]: a.size()
Out[5]: torch.Size([67108864])

In [6]: b = torch.rand(1024*1024*64)

In [7]: b_cuda = b.cuda()

In [8]: import gc

In [9]: gc.collect()
Out[9]: 87

In [10]: gc.collect()
Out[10]: 7

In [11]: gc.collect()
Out[11]: 7

nvidia-smi:

Tue Aug 15 12:16:56 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.77                 Driver Version: 361.77                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   38C    P0    40W / 150W |    758MiB /  7618MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     52947    C   /mldata/conda/envs/pytorch/bin/python          756MiB |
+-----------------------------------------------------------------------------+

ASAIK, PyTorch uses a caching allocator, while the memory is “free”, this is not reflected in the view from the device.

1 Like

The 64 million float tensor only takes up 256 MB. The other ~250 MB is from all the CUDA kernels in libTHC.so and libTHCUNN.so. They’re loaded when CUDA is first initialized, which happened when you called a.cuda(). We have a lot of CUDA kernels since many are defined for every data type.

2 Likes

Double-precision floating-point format occupies 8 bytes not 4 I thought

Scratch that I see you have float not double :sweat_smile: