Hi there,
I use torchrun to start a distributed training with 8xA100.
The training succeeds in most cases.
However, very weirdly, sometimes it just randomly stops at some iterations with ‘python segmentation fault’ error as follows.
The used pytorch version is py3.9_cuda11.3_cudnn8.3.2_0.
Seems like the error message didn’t provide useful information for debugging. How can I locate the code that is causing this problem?
2022-11-26T13:25:11.773614372Z Fatal Python error: Segmentation fault
2022-11-26T13:25:11.773660430Z
2022-11-26T13:25:11.773666157Z Thread 0x00007f09fa8f5700 (most recent call first):
2022-11-26T13:25:11.773671399Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
2022-11-26T13:25:11.773677290Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 98 in set_state
2022-11-26T13:25:11.773682641Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 396 in sync
2022-11-26T13:25:11.773687807Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 606 in run
2022-11-26T13:25:11.773693158Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1143 in _keep_alive
2022-11-26T13:25:11.773698338Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1133 in _keep_alive_weak
2022-11-26T13:25:11.773807756Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 255 in _run
2022-11-26T13:25:11.773852673Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/threading.py", line 917 in run
2022-11-26T13:25:11.773863571Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/threading.py", line 980 in _bootstrap_inner
2022-11-26T13:25:11.773872653Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/threading.py", line 937 in _bootstrap
2022-11-26T13:25:11.773881758Z
2022-11-26T13:25:11.773890089Z Current thread 0x00007f0b0ae2b280 (most recent call first):
2022-11-26T13:25:11.773899022Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 203 in _monitor_workers
2022-11-26T13:25:11.773922348Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper
2022-11-26T13:25:11.773932087Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 851 in _invoke_run
2022-11-26T13:25:11.773941996Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run
2022-11-26T13:25:11.773954682Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper
2022-11-26T13:25:11.773988083Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent
2022-11-26T13:25:11.774020743Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131 in __call__
2022-11-26T13:25:11.774161902Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/run.py", line 752 in run
2022-11-26T13:25:11.774178066Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/run.py", line 761 in main
2022-11-26T13:25:11.774212095Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper
2022-11-26T13:25:11.774280675Z File "/root/miniconda3/envs/py39_pytorch1121_cu113/bin/torchrun", line 8 in <module>
2022-11-26T13:25:11.809051055Z bash: line 1: 194 Segmentation fault CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 --nnodes=1 --nproc_per_node=8 train_dist.py --config ./config/xxx.json