Training fails due to memory exhaustion when running in a python multiprocessing.Process

When deploying our training code, we schedule training jobs in a multiprocessing.Process with forkserver context in order to supervise the work, isolate errors, etc.

Sadly, this crashes for training on sufficiently large dataset due to exhaustion of shared memory (/dev/shm). If increasing /dev/shm size was the answer, I wouldn’t be here. The problem is that it consistently fails as soon as the shared memory allocation reaches 250 MB exactly, every time.

We have no idea who, what, where this seemingly arbitrary limit comes from. Any help or pointers would be appreciated. I acknowledge this issue might not be directly related to pytorch, but I’m not sure where else to ask it.

Some additional environment information:

  • Using python 3.11.9
  • Tested on pytorch 2.1.2, 2.3.0
  • /dev/shm size is >10 GB
  • Reproducible with spawn, forkserver, but not fork
  • Reproducible within docker (on RHEL 8) and WSL 2 (Ubuntu 20.04)
  • num_workers must be > 0. Even num_workers = 1, batch_size = 1 fails.
  • The error seems to be within DataLoader specifically when it is creating the iterator.
  • Training works fine running it outside multiprocessing.Process

Partial stacktrace

  File "/home/.../.cache/pypoetry/virtualenvs/.../lib/python3.11/site-packages/torch/utils/data/", line 434, in __iter__
    self._iterator = self._get_iterator()
  File "/home/.../.cache/pypoetry/virtualenvs/.../lib/python3.11/site-packages/torch/utils/data/", line 387, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/.../.cache/pypoetry/virtualenvs/.../lib/python3.11/site-packages/torch/utils/data/", line 1040, in __init__
  File "/usr/lib/python3.11/multiprocessing/", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.11/multiprocessing/", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.11/multiprocessing/", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.11/multiprocessing/", line 32, in __init__
  File "/usr/lib/python3.11/multiprocessing/", line 19, in __init__
  File "/usr/lib/python3.11/multiprocessing/", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.11/multiprocessing/", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/home/.../.cache/pypoetry/virtualenvs/.../lib/python3.11/site-packages/torch/multiprocessing/", line 557, in reduce_storage
    metadata = storage._share_filename_cpu_()
  File "/home/.../.cache/pypoetry/virtualenvs/.../lib/python3.11/site-packages/torch/", line 304, in wrapper
    return fn(self, *args, **kwargs)
  File "/home/.../.cache/pypoetry/virtualenvs/.../lib/python3.11/site-packages/torch/", line 378, in _share_filename_cpu_
    return super()._share_filename_cpu_(*args, **kwargs)
RuntimeError: unable to mmap 320 bytes from file </torch_8549_3875816624_63978>: Cannot allocate memory (12)

One wrong assumption threw us way off track… The multiprocessing context used by pytorch is dependent on the context of the parent process, we assumed it wasn’t.

Basically, by spinning off a python process using spawn or forkserver, the pytorch workers also switched to that context, leading to forkserver being the context used within pytorch. Once we added set_start_method("fork", force=True), the problem went away. Turns out, regardless of how the parent process creates the sub process, whether through fork, forkserver, or spawn, if the pytorch context is anything other than fork, this leads to extensive usage of named files/file descriptors (based on your selection of sharing strategy), and that hits the system limits in different ways.

If using file descriptors, you hit “too many open files” or “too many fds”, since the number created (in our scenario) exceeds 15k fds. If the strategry is file_system, the error seems to be due to limits on “max_map_count”, found under /proc/sys/vm/max_map_count, and that limit, multiplied by 4 KB seems to give the magical 250 MB number.

We simply set the torch mp_context to fork and called it a day.