I did some experiment with below code:
import torch
cpu, gpu = torch.device('cpu'), torch.device('cuda:0')
size = (1, 1, 1)
t_gpu1 = torch.zeros(size, device=gpu)
With command pmap -x pid
, I see that there is a large anon
mapping memory(about 1GB), which is non file-backend memory and most of it is in RSS
.
000055d6b008d000 1172860 1171972 1171972 rw--- [ anon ]
What’s more, from the output of cat /proc/pid/smaps
, seems this large memory is private, which means it can not be shared between multiple process. And it is dirty, which means it need to be written out to swap area so it can be recovered when necessary(I don’t know why it is Dirty
?).
VmFlags: rd wr mr mw me ac sd
55d6b008d000-55d6f79ec000 rw-p 00000000 00:00 0 [heap]
Size: 1172860 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 1171972 kB
Pss: 1171972 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 1171972 kB
Referenced: 1171336 kB
Anonymous: 1171972 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
THPeligible: 0
Finally I try to use cuda-gdb
to attach this python process and dump the memory at this address:
(cuda-gdb) dump binary memory pytorch-mem.bin 0x000055d6b008d000 0x000055d6b008d000+1172860000
(cuda-gdb) shell cat pytorch-mem.cat | strings > pytorch-mem.bin.strings
I see lots of kernel code symbols in this memory area as below, but why most of them are loaded into memory with so simple code run, and since this memory is Private, which means it can not be shared across multiple process, so when multiple process run at the same time, each of them will take this much memory ?
And I check the size of libtorch.so
, actually it is about 1.2G
, which should contain both cpu code and gpu(kernel) code. Below is the file-backend mapping of the libtorch.so
7f7c896f7000-7f7ccced2000 r-xp 00000000 08:02 38546131 /usr/share/miniconda3/envs/venv/lib/python3.7/site-packages/torch/lib/libtorch.so
Size: 1105772 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 304636 kB
Pss: 304636 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 304636 kB
Private_Dirty: 0 kB
Referenced: 304636 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
THPeligible: 0
VmFlags: rd ex mr mw me sd
It is Private
memory mapping, which also a little surprises me, will this mean it can not be shared across multiple process even for the CPU code?
For the large anon
mapping which contains the kernel code, my guess is that maybe cuda driver allocate that memory to store those gpu code, and those area is mapped to gpu’s virtual address space thus the gpu core can directly access and execute it. Does anyone know more detail here?