NCCL operations have failed or timed out

Baraa · August 30, 2024, 10:14pm

Any help to explain what this error is greatly appreciated!
I run the following command line:

python -m torch.distributed.launch --rdzv_endpoint=localhost:29400 --nproc_per_node=2 main.py


Exception in thread Thread-1:
Traceback (most recent call last):
  File "/pre-compiled/python/3.7/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/LIMPQ/env/lib/python3.7/site-packages/tensorboard/summary/writer/event_file_writer.py", line 233, in run
    self._record_writer.write(data)
  File "/LIMPQ/env/lib/python3.7/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write
    self._writer.write(header + header_crc + data + footer_crc)
  File "/wsu/home/gn/gn75/gn7599/LIMPQ/env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 766, in write
    self.fs.append(self.filename, file_content, self.binary_mode)
  File "/LIMPQ/env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 160, in append
    self._write(filename, file_content, "ab" if binary_mode else "a")
  File "/LIMPQ/env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 166, in _write
    f.write(compatify(file_content))
OSError: [Errno 5] Input/output error

INFO - Training [8][ 2640/20019]   Loss 3.074509   Top1 50.600142   Top5 74.356652   BatchTime 0.343744   LR 0.038725
INFO - --------------------------------------------------------------------------------------------------------------
INFO - Training [8][ 2660/20019]   Loss 3.073985   Top1 50.612077   Top5 74.376762   BatchTime 0.343832   LR 0.038725
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1466105, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803435 milliseconds before timi$
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1466105, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803435 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 610 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 611) of binary: /LIMPQ/env/bin/python
Traceback (most recent call last):
  File "/pre-compiled/python/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/pre-compiled/python/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/LIMPQ/env/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/LIMPQ/env/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/LIMPQ/env/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/LIMPQ/env/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
    )(*cmd_args)
  File "/LIMPQ/env/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/LIMPQ/env/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
====================================================
main.py FAILED
----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-30_17:45:05
  host      : vohost
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 611)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 611
====================================================