Segmentation fault when using Torchrun

app31t · May 14, 2024, 11:32am

My program is using two GPUs with DDP while I am looping over a selection of different hyperparameters. I made some changes to make it compatible with torchrun as explained in this tutorial to achieve parallel computation: https://www.youtube.com/watch?v=9kIvQOiwYzg&list=PL_lsbAsL_o2CSuhUhJIiW0IkdT5C2wGWj&index=4

When starting torchrun, I now get this error:

torchrun --standalone --nnodes=1 --nproc-per-node=1 network/main.py
Fatal Python error: Segmentation fault

Current thread 0x00007fec5de4c740 (most recent call first):
  File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
  File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in __init__
  File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 253 in create_backend
  File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler
  File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 258 in create_handler
  File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler
  File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 238 in launch_agent
  File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135 in __call__
  File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/run.py", line 803 in run
  File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/run.py", line 812 in main
  File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347 in wrapper
  File "/home/anaconda3/envs/torch/bin/torchrun", line 8 in <module>

Extension modules: mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 22)
Segmentation fault (core dumped)

I already tried to run it in a new conda env with freshly installed torch, numpy and mkl but the eroor stays the same.

raghavm1 · June 10, 2024, 4:38am

Might be a bit too late here, but if your python version 3.12, assuming you haven’t provided rdvz-backend which defaults to c10d, this is a known issue which very recently got fixed.

Source - torchrun c10d backend doesn't seem to work with python 3.12, giving segmentation fault because of calling obmalloc without holding GIL · Issue #125990 · pytorch/pytorch · GitHub