How to fix SIGSEGV in distributed training (i.e. DDP) 2

dsethz · May 11, 2021, 5:15pm

So I have found a workaround that seems to work:

Running the above script my_script.py via

python -X dev my_script.py --devices $CUDA_VISIBLE_DEVICES

yielded the following error:
Fatal Python error: PyEval_SaveThread: the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)

Searching for that error, I found the following bug report. I do not really understand the details of the therein referenced bug report, but the way python 3.9 handles GIL seems to cause the SIGSEGV when running mp.spawn().

Workaround: Downgrading to python 3.8.7 got rid of the SIGSEGV and my_script.py runs without any errors.

If you have further suggestions how to make the script work with python 3.9, I would be curious to test it.