Ddp training randomly throws python segmentation fault

Hi there,
I use torchrun to start a distributed training with 8xA100.
The training succeeds in most cases.
However, very weirdly, sometimes it just randomly stops at some iterations with ‘python segmentation fault’ error as follows.

The used pytorch version is py3.9_cuda11.3_cudnn8.3.2_0.

Seems like the error message didn’t provide useful information for debugging. How can I locate the code that is causing this problem?

2022-11-26T13:25:11.773614372Z Fatal Python error: Segmentation fault
2022-11-26T13:25:11.773660430Z
2022-11-26T13:25:11.773666157Z Thread 0x00007f09fa8f5700 (most recent call first):
2022-11-26T13:25:11.773671399Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
2022-11-26T13:25:11.773677290Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 98 in set_state
2022-11-26T13:25:11.773682641Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 396 in sync
2022-11-26T13:25:11.773687807Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 606 in run
2022-11-26T13:25:11.773693158Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1143 in _keep_alive
2022-11-26T13:25:11.773698338Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1133 in _keep_alive_weak
2022-11-26T13:25:11.773807756Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 255 in _run
2022-11-26T13:25:11.773852673Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/threading.py", line 917 in run
2022-11-26T13:25:11.773863571Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/threading.py", line 980 in _bootstrap_inner
2022-11-26T13:25:11.773872653Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/threading.py", line 937 in _bootstrap
2022-11-26T13:25:11.773881758Z
2022-11-26T13:25:11.773890089Z Current thread 0x00007f0b0ae2b280 (most recent call first):
2022-11-26T13:25:11.773899022Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 203 in _monitor_workers
2022-11-26T13:25:11.773922348Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper
2022-11-26T13:25:11.773932087Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 851 in _invoke_run
2022-11-26T13:25:11.773941996Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run
2022-11-26T13:25:11.773954682Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper
2022-11-26T13:25:11.773988083Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent
2022-11-26T13:25:11.774020743Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131 in __call__
2022-11-26T13:25:11.774161902Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/run.py", line 752 in run
2022-11-26T13:25:11.774178066Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/run.py", line 761 in main
2022-11-26T13:25:11.774212095Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper
2022-11-26T13:25:11.774280675Z   File "/root/miniconda3/envs/py39_pytorch1121_cu113/bin/torchrun", line 8 in <module>
2022-11-26T13:25:11.809051055Z bash: line 1:   194 Segmentation fault      CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 --nnodes=1 --nproc_per_node=8 train_dist.py --config ./config/xxx.json

Could you update to the latest PyTorch release (if you are using an older one) and check if you are still seeing these issues?
If so, could you try to get the stacktrace from gdb and post it here, please?

Thanks for the reply. I tried the newest pytorch 1.13 and the problem still exists.

On stacktrace, how to use gdb with torchrun? I am not quite familiar with gdb. It would be great if there is link to a tutorial or something. Thanks!

Something like this should work:

gdb --args torchrun other_torchrun_arguments
...
run
...
bt

This command leads to

“/opt/conda/envs/py_3.9/bin/torchrun”: not in executable format: file format not recognized

Should be

gdb --args python /opt/conda/envs/py_3.9/bin/torchrun other_torchrun_arguments