Why PyTorch used so many CPU RAM?

I did some experiment with below code:

import torch
cpu, gpu = torch.device('cpu'), torch.device('cuda:0')
size = (1, 1, 1)
t_gpu1 = torch.zeros(size, device=gpu)

With command pmap -x pid, I see that there is a large anon mapping memory(about 1GB), which is non file-backend memory and most of it is in RSS.

000055d6b008d000 1172860 1171972 1171972 rw---   [ anon ]

What’s more, from the output of cat /proc/pid/smaps, seems this large memory is private, which means it can not be shared between multiple process. And it is dirty, which means it need to be written out to swap area so it can be recovered when necessary(I don’t know why it is Dirty?).

VmFlags: rd wr mr mw me ac sd 
55d6b008d000-55d6f79ec000 rw-p 00000000 00:00 0                          [heap]
Size:            1172860 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:             1171972 kB
Pss:             1171972 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:   1171972 kB
Referenced:      1171336 kB
Anonymous:       1171972 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:		0

Finally I try to use cuda-gdb to attach this python process and dump the memory at this address:

(cuda-gdb) dump binary memory pytorch-mem.bin 0x000055d6b008d000 0x000055d6b008d000+1172860000
(cuda-gdb) shell cat pytorch-mem.cat | strings > pytorch-mem.bin.strings 

I see lots of kernel code symbols in this memory area as below, but why most of them are loaded into memory with so simple code run, and since this memory is Private, which means it can not be shared across multiple process, so when multiple process run at the same time, each of them will take this much memory ?

And I check the size of libtorch.so, actually it is about 1.2G, which should contain both cpu code and gpu(kernel) code. Below is the file-backend mapping of the libtorch.so

7f7c896f7000-7f7ccced2000 r-xp 00000000 08:02 38546131                   /usr/share/miniconda3/envs/venv/lib/python3.7/site-packages/torch/lib/libtorch.so
Size:            1105772 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:              304636 kB
Pss:              304636 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:    304636 kB
Private_Dirty:         0 kB
Referenced:       304636 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:		0
VmFlags: rd ex mr mw me sd

It is Private memory mapping, which also a little surprises me, will this mean it can not be shared across multiple process even for the CPU code?

For the large anon mapping which contains the kernel code, my guess is that maybe cuda driver allocate that memory to store those gpu code, and those area is mapped to gpu’s virtual address space thus the gpu core can directly access and execute it. Does anyone know more detail here?

According to https://unix.stackexchange.com/questions/478533/why-are-program-images-and-shared-libraries-considered-private-to-a-process, the Private pages is copy-on-write, if it is only readable, it still can be shared by the kernel in physical memory. But the anon part is still a issue, because now it is Dirty.

Hi,

I don’t think there is anything we can do about this. The cuda driver loads all the code this way :confused:
I can think of two reasons for this: one is to have all the code readily available and that can be easily sent to the GPU. The other one is to be able to do unified memory.

Thanks for your quick response.

Is it possible for PyTorch to use CUDA Driver API instead of CUDA Runtime API, thus PyTorch should have more control over how to load the kernel code.

This issue becomes a bottleneck when we try to deploy our product with multiple process, for 14 process, it takes 85%+ memory of 32G. It is ok for train, but when used it in production scenarios, it is an issue.

1 Like

Maybe @ptrblck has a better idea of how feasible that is?

I’m not sure that the Driver API would solve your problem and would need some more information about your use case and what you are trying to achieve with these 14 processes.

I.e. is each process loading a different model or script or are these just replica on a single GPU?

Our case is that we have 2 GPUs, created 7 process for each GPU to deal with different inputs. Each process will load the same model into GPU(I know we can optimize here), our initial thinking is that GPU RAM should be the bottleneck, but reality is that CPU becomes the bottleneck first, that’s the reason why I did the investigation and seek help here.

1 Like

I’m not sure why you need a dedicated process for different inputs.
Could you explain a bit more, in which way the inputs differ?

That might be the case, e.g. if your data loading is not fast enough.
However, do you see that multiple processes (including recreating the model) on a single device speed things up?
I would assume that multiple workers in a single DataLoader would give a better performance.

The inputs are different video streams, sorry, I can’t talk about more detail here.

Our current arch is each process deal with each video stream separately, and we have MPS enabled. Another arch is to only use one process and batch all inputs together. One bottleneck here is the GPU memory for large batch size. So you think the latter one will have better performance? (Let’s just assume after all model initialized)

Hi, have you been able to find any solution? Currently struggling with CPU RAM saturation myself. It’s awful that CPU RAM stays full even after moving models to GPU.

1 Like

Is there any update on this issue? I had the same issue, which makes impossible to train any model…

1 Like

I confirm that I am also having the same issue, it is becoming very difficult to deploy models to production across multiple inference workers as CPU RAM is running out fast.