Training fails due to memory exhaustion when running in a python multiprocessing.Process

mbbyn · May 15, 2024, 9:38am

One wrong assumption threw us way off track… The multiprocessing context used by pytorch is dependent on the context of the parent process, we assumed it wasn’t.

Basically, by spinning off a python process using spawn or forkserver, the pytorch workers also switched to that context, leading to forkserver being the context used within pytorch. Once we added set_start_method("fork", force=True), the problem went away. Turns out, regardless of how the parent process creates the sub process, whether through fork, forkserver, or spawn, if the pytorch context is anything other than fork, this leads to extensive usage of named files/file descriptors (based on your selection of sharing strategy), and that hits the system limits in different ways.

If using file descriptors, you hit “too many open files” or “too many fds”, since the number created (in our scenario) exceeds 15k fds. If the strategry is file_system, the error seems to be due to limits on “max_map_count”, found under /proc/sys/vm/max_map_count, and that limit, multiplied by 4 KB seems to give the magical 250 MB number.

We simply set the torch mp_context to fork and called it a day.