My program is using two GPUs with DDP while I am looping over a selection of different hyperparameters. I made some changes to make it compatible with torchrun as explained in this tutorial to achieve parallel computation: https://www.youtube.com/watch?v=9kIvQOiwYzg&list=PL_lsbAsL_o2CSuhUhJIiW0IkdT5C2wGWj&index=4
When starting torchrun, I now get this error:
torchrun --standalone --nnodes=1 --nproc-per-node=1 network/main.py
Fatal Python error: Segmentation fault
Current thread 0x00007fec5de4c740 (most recent call first):
File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in __init__
File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 253 in create_backend
File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler
File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 258 in create_handler
File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler
File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 238 in launch_agent
File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135 in __call__
File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/run.py", line 803 in run
File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/run.py", line 812 in main
File "/home/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347 in wrapper
File "/home/anaconda3/envs/torch/bin/torchrun", line 8 in <module>
Extension modules: mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 22)
Segmentation fault (core dumped)
I already tried to run it in a new conda env with freshly installed torch, numpy and mkl but the eroor stays the same.