(Note, interestingly, if I create another 64 million float tensor, the memory only increases by ~256MB, as I’d expect. I thought maybe some gc thing, but calling:
gc.collect()
gc.collect()
gc.collect()
… memory is still ~756MB:
In [1]: import torch
In [2]: a = torch.rand(1024*1024*64)
In [3]: a_cuda = a.cuda()
In [4]: type(a_cuda)
Out[4]: torch.cuda.FloatTensor
In [5]: a.size()
Out[5]: torch.Size([67108864])
In [6]: b = torch.rand(1024*1024*64)
In [7]: b_cuda = b.cuda()
In [8]: import gc
In [9]: gc.collect()
Out[9]: 87
In [10]: gc.collect()
Out[10]: 7
In [11]: gc.collect()
Out[11]: 7
nvidia-smi:
Tue Aug 15 12:16:56 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.77 Driver Version: 361.77 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 0000:00:1E.0 Off | 0 |
| N/A 38C P0 40W / 150W | 758MiB / 7618MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 52947 C /mldata/conda/envs/pytorch/bin/python 756MiB |
+-----------------------------------------------------------------------------+
The 64 million float tensor only takes up 256 MB. The other ~250 MB is from all the CUDA kernels in libTHC.so and libTHCUNN.so. They’re loaded when CUDA is first initialized, which happened when you called a.cuda(). We have a lot of CUDA kernels since many are defined for every data type.