Unable to open shared memory object

EvanZ · August 12, 2022, 6:29pm

My code was previously working fine, but I am trying to set up a new instance on EC2 and I think I must be missing something that I did before because I get the following error:

Exception has occurred: RuntimeError
unable to open shared memory object </torch_90969_3264142153_8165> in read-write mode: Too many open files (24)
  File "/path/to/training_script.py", line 142, in <module>
    trainer.fit(model, datamodule=dm)

I saw in a previous topic with this error (RuntimeError: unable to open shared memory object - #9 by Maxwell_Albert) that setting the num-workers to 0 helped, but I tried that and it still resulted in the same error. I am using:

absl-py                   1.2.0
aiohttp                   3.8.1
aiosignal                 1.2.0
alembic                   1.8.1
async-timeout             4.0.2
attrs                     22.1.0
cachetools                5.2.0
certifi                   2022.6.15
charset-normalizer        2.1.0
click                     8.1.3
cloudpickle               2.1.0
cramjam                   2.5.0
databricks-cli            0.17.1
docker                    5.0.3
entrypoints               0.4
fastparquet               0.8.1
Flask                     2.2.2
frozenlist                1.3.1
fsspec                    2022.7.1
gitdb                     4.0.9
GitPython                 3.1.27
google-auth               2.10.0
google-auth-oauthlib      0.4.6
greenlet                  1.1.2
grpcio                    1.47.0
gunicorn                  20.1.0
idna                      3.3
importlib-metadata        4.12.0
itsdangerous              2.1.2
Jinja2                    3.1.2
joblib                    1.1.0
Mako                      1.2.1
Markdown                  3.4.1
MarkupSafe                2.1.1
mlflow                    1.28.0
multidict                 6.0.2
numpy                     1.23.1
oauthlib                  3.2.0
packaging                 21.3
pandas                    1.4.3
pip                       22.2.2
prometheus-client         0.14.1
prometheus-flask-exporter 0.20.3
protobuf                  3.19.4
pyasn1                    0.4.8
pyasn1-modules            0.2.8
pyDeprecate               0.3.2
PyJWT                     2.4.0
pyparsing                 3.0.9
python-dateutil           2.8.2
pytorch-lightning         1.7.1
pytz                      2022.2
PyYAML                    6.0
querystring-parser        1.2.4
requests                  2.28.1
requests-oauthlib         1.3.1
rsa                       4.9
scikit-learn              1.1.2
scipy                     1.9.0
setuptools                49.2.1
six                       1.16.0
sklearn                   0.0
smmap                     5.0.0
SQLAlchemy                1.4.40
sqlparse                  0.4.2
tabulate                  0.8.10
tensorboard               2.10.0
tensorboard-data-server   0.6.1
tensorboard-plugin-wit    1.8.1
threadpoolctl             3.1.0
torch                     1.12.1
torchinfo                 1.7.0
torchmetrics              0.9.3
torchtext                 0.13.1
tqdm                      4.64.0
typing_extensions         4.3.0
urllib3                   1.26.11
websocket-client          1.3.3
Werkzeug                  2.2.2
wheel                     0.37.1
yarl                      1.8.1
zipp                      3.8.1

Any suggestions much appreciated.

ptrblck · August 12, 2022, 7:56pm

You might need to increase the file descriptor limit via ulimit -n.

EvanZ · August 12, 2022, 8:29pm

I tried that as well, to no avail.

HELLORPG · April 21, 2025, 3:45pm

I got the same issue, and increase the ulimit -n did not solve my problem.