RuntimeError: unable to open shared memory object (while using model.share_memory())

zarzen · May 12, 2020, 3:15am

Hi,

I would like to put multiple models into CPU memory with shared memory support. So that I can easily transfer models among multiple processes.
I am using code logic as following:

...
a_list_of_models = load_models()

for model in a_list_of_models:
    model.share_memory()
...

The code works fine with 4-5 models, but if I launched more models, I got RuntimeError: unable to open shared memory object.

I saw many other people have the same runtime error while using dataloader. While I am not using dataloader here, so I am not able to set the parameter num_worker to be 0.

I have also checked the shared memory limitation on my machine is unlimited. And the physical memory is more than enough to hold 100 models.

The most relevant topic is this one: How to configure shared memory size?
but, unfortunately, no answer there.

any suggestions?

rvarm1 · May 12, 2020, 4:47am

Hi,

Could you possibly share a more comprehensive repro of the issue such as the models that you are loading, if possible?

How did you check the shared memory limit on your machine? Does the output of ipcs -lm indicate that the shared memory is unlimited (this should tell you the max no. of SHM segments and max SHM size)? Also, just to confirm, have you checked if any other processes are taking up too much shared memory on your machine?

To increase the shared memory limit, you can try setting the kernel.shmmax parameter (i.e. sysctl -w kernel.shmmax=...), and follow the steps here: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/tuning_and_optimizing_red_hat_enterprise_linux_for_oracle_9i_and_10g_databases/sect-oracle_9i_and_10g_tuning_guide-setting_shared_memory-setting_shmmni_parameter to increase the # of shared memory segments available.

zarzen · May 12, 2020, 4:43pm

Hi Rohan,

here is the code snippet for reproducing the bug:

import torch
from torchvision import models
import time

def main():
    """"""
    model_list = [
        ['resnet152', models.resnet152],
        ['inception_v3', models.inception_v3],
        ['vgg16', models.vgg16],
        ['vgg19', models.vgg19],
        ['vgg19_bn', models.vgg19_bn],
        ['densenet201', models.densenet201],
        ['densenet169', models.densenet169],
        ['resnet152-2', models.resnet152],
        ['resnet152-3', models.resnet152],
        ['resnet152-4', models.resnet152],
        ['resnet152-5', models.resnet152],
        ['resnet152-6', models.resnet152],
        ['resnet152-7', models.resnet152],
        ['resnet152-8', models.resnet152],
        ['resnet152-9', models.resnet152]
    ]
    models_dict = {}

    for m in model_list:
        model = m[1](pretrained=True)
        model.share_memory()
        models_dict[m[0]] = model
        print('loaded ', m[0])

    # while True:
    #     time.sleep(1)

if __name__ == "__main__":
    main()

Here is the outputs of ipcs -lm:


------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398509481980
min seg size (bytes) = 1

I have tried to increase max number of segments to 8192 (using sysctl -w kernel.shmmni=8192), it does not help.

zarzen · May 12, 2020, 5:39pm

Another interesting fact:
before the program got crash, I didn’t see any file created in /dev/shm folder. While the disk usage of /dev/shm is increasing (based on df -h command)