NCCL Flight Recorder: trace files not produced

Schups · October 19, 2024, 6:40pm

Hi guys, I am currently facing NCCL timeouts on a loss.backward() and testing out flight recorder.

Question 1: regarding the tutorial ((prototype) Flight Recorder for Debugging Stuck Jobs — PyTorch Tutorials 2.5.0+cu124 documentation), where / how is torch._C._distributed_c10d._dump_nccl_trace() supposed to be executed? On the same process that is experiencing an hang?
Question 2: given the doubt above I am moving forward with the dump on filesystem, but the following generates no trace files on the filesystem. What am I missing? I tested FT through a toy application and files appeared although after some minutes (despite being < 1kb).

Application: https://github.com/NVIDIA/modulus/blob/main/examples/weather/graphcast/train_graphcast.py

Environment: 8 processes (2 nodes running 4 processes each) based on nvcr.io/nvidia/modulus/modulus:24.09

Script:

export TORCH_NCCL_DUMP_ON_TIMEOUT=1
export TORCH_NCCL_DEBUG_INFO_TEMP_FILE=$(pwd)/frtraces
python train_graphcast.py synthetic_dataset=true

Logs extract:

...
nid002288:46634:46634 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
...
nid002288:46635:48305 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.6.0
...
[rank3]:[E1019 15:08:24.992240043 ProcessGroupNCCL.cpp:1416] [PG 0 (default_pg) Rank 3] First PG on this rank that detected no heartbeat of its watchdog.
[rank3]:[E1019 15:08:24.992289829 ProcessGroupNCCL.cpp:1454] [PG 0 (default_pg) Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank3]:[E1019 15:08:24.993473826 ProcessGroupNCCL.cpp:1245] [PG 0 (default_pg) Rank 3] ProcessGroupNCCL preparing to dump debug info.
[rank2]:[E1019 15:08:24.010946730 ProcessGroupNCCL.cpp:1416] [PG 0 (default_pg) Rank 2] First PG on this rank that detected no heartbeat of its watchdog.
[rank2]:[E1019 15:08:24.010976929 ProcessGroupNCCL.cpp:1454] [PG 0 (default_pg) Rank 2] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank1]:[E1019 15:08:24.011597257 ProcessGroupNCCL.cpp:1416] [PG 0 (default_pg) Rank 1] First PG on this rank that detected no heartbeat of its watchdog.
[rank1]:[E1019 15:08:24.011631443 ProcessGroupNCCL.cpp:1454] [PG 0 (default_pg) Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank2]:[E1019 15:08:24.011645099 ProcessGroupNCCL.cpp:1245] [PG 0 (default_pg) Rank 2] ProcessGroupNCCL preparing to dump debug info.
[rank1]:[E1019 15:08:24.011821029 ProcessGroupNCCL.cpp:1245] [PG 0 (default_pg) Rank 1] ProcessGroupNCCL preparing to dump debug info.
[rank6]:[E1019 15:08:24.169857543 ProcessGroupNCCL.cpp:1416] [PG 0 (default_pg) Rank 6] First PG on this rank that detected no heartbeat of its watchdog.
[rank6]:[E1019 15:08:24.169899635 ProcessGroupNCCL.cpp:1454] [PG 0 (default_pg) Rank 6] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank6]:[E1019 15:08:24.170951501 ProcessGroupNCCL.cpp:1245] [PG 0 (default_pg) Rank 6] ProcessGroupNCCL preparing to dump debug info.
[rank4]:[E1019 15:08:26.173684563 ProcessGroupNCCL.cpp:1416] [PG 0 (default_pg) Rank 4] First PG on this rank that detected no heartbeat of its watchdog.
[rank4]:[E1019 15:08:26.173726825 ProcessGroupNCCL.cpp:1454] [PG 0 (default_pg) Rank 4] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank4]:[E1019 15:08:26.173889069 ProcessGroupNCCL.cpp:1245] [PG 0 (default_pg) Rank 4] ProcessGroupNCCL preparing to dump debug info.
[rank5]:[E1019 15:08:28.120236294 ProcessGroupNCCL.cpp:1416] [PG 0 (default_pg) Rank 5] First PG on this rank that detected no heartbeat of its watchdog.
[rank5]:[E1019 15:08:28.120280269 ProcessGroupNCCL.cpp:1454] [PG 0 (default_pg) Rank 5] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank5]:[E1019 15:08:28.120506898 ProcessGroupNCCL.cpp:1245] [PG 0 (default_pg) Rank 5] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[E1019 15:08:33.013034036 ProcessGroupNCCL.cpp:1416] [PG 0 (default_pg) Rank 0] First PG on this rank that detected no heartbeat of its watchdog.
[rank0]:[E1019 15:08:33.013073652 ProcessGroupNCCL.cpp:1454] [PG 0 (default_pg) Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank0]:[E1019 15:08:33.013304909 ProcessGroupNCCL.cpp:1245] [PG 0 (default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank7]:[E1019 15:08:33.079467506 ProcessGroupNCCL.cpp:1416] [PG 0 (default_pg) Rank 7] First PG on this rank that detected no heartbeat of its watchdog.
[rank7]:[E1019 15:08:33.079509437 ProcessGroupNCCL.cpp:1454] [PG 0 (default_pg) Rank 7] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank7]:[E1019 15:08:33.080099980 ProcessGroupNCCL.cpp:1245] [PG 0 (default_pg) Rank 7] ProcessGroupNCCL preparing to dump debug info.
[rank3]:[F1019 15:18:24.993845844 ProcessGroupNCCL.cpp:1263] [PG 0 (default_pg) Rank 3] [PG 0 (default_pg) Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
[rank2]:[F1019 15:18:24.011878848 ProcessGroupNCCL.cpp:1263] [PG 0 (default_pg) Rank 2] [PG 0 (default_pg) Rank 2] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
[rank1]:[F1019 15:18:24.012508183 ProcessGroupNCCL.cpp:1263] [PG 0 (default_pg) Rank 1] [PG 0 (default_pg) Rank 1] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
[rank6]:[F1019 15:18:24.171246362 ProcessGroupNCCL.cpp:1263] [PG 0 (default_pg) Rank 6] [PG 0 (default_pg) Rank 6] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
[rank4]:[F1019 15:18:26.174143455 ProcessGroupNCCL.cpp:1263] [PG 0 (default_pg) Rank 4] [PG 0 (default_pg) Rank 4] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
[rank5]:[F1019 15:18:28.121026428 ProcessGroupNCCL.cpp:1263] [PG 0 (default_pg) Rank 5] [PG 0 (default_pg) Rank 5] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
[rank0]:[F1019 15:18:33.013580930 ProcessGroupNCCL.cpp:1263] [PG 0 (default_pg) Rank 0] [PG 0 (default_pg) Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
[rank7]:[F1019 15:18:33.080397812 ProcessGroupNCCL.cpp:1263] [PG 0 (default_pg) Rank 7] [PG 0 (default_pg) Rank 7] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
srun: error: nid002285: task 1: Aborted (core dumped)
srun: error: nid002285: tasks 2-3: Aborted (core dumped)
srun: error: nid002288: task 6: Aborted (core dumped)
srun: error: nid002288: task 4: Aborted (core dumped)
srun: error: nid002288: task 5: Aborted (core dumped)
Exception ignored in atexit callback: <function dump_compile_times at 0x15084ff9f1c0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 337, in dump_compile_times
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 324, in compile_times
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 129, in tabulate
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 879, in exec_module
  File "<frozen importlib._bootstrap_external>", line 1016, in get_code
  File "<frozen importlib._bootstrap_external>", line 1073, in get_data
OSError: [Errno 107] Transport endpoint is not connected: '/usr/local/lib/python3.10/dist-packages/tabulate/__init__.py'
Exception ignored in atexit callback: <function dump_compile_times at 0x145d7b1871c0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 337, in dump_compile_times
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 324, in compile_times
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 129, in tabulate
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 879, in exec_module
  File "<frozen importlib._bootstrap_external>", line 1016, in get_code
  File "<frozen importlib._bootstrap_external>", line 1073, in get_data
OSError: [Errno 107] Transport endpoint is not connected: '/usr/local/lib/python3.10/dist-packages/tabulate/__init__.py'
srun: error: nid002285: task 0: Aborted (core dumped)
srun: error: nid002288: task 7: Aborted (core dumped)

Thanks

c-p-i-o · January 9, 2025, 1:20am

Sorry about the delay here.

If DUMP_ON_TIMEOUT is enabled, you’re supposed to see trace files in configured directory.
The 2.5 version had a default long sleep to give the system some time to write out the files. We’ve since removed the sleep - instead exiting as soon as the diagnostic file is written out. [(PR link)]