Any help to explain what this error is greatly appreciated!
I run the following command line:
python -m torch.distributed.launch --rdzv_endpoint=localhost:29400 --nproc_per_node=2 main.py
Exception in thread Thread-1:
Traceback (most recent call last):
File "/pre-compiled/python/3.7/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/LIMPQ/env/lib/python3.7/site-packages/tensorboard/summary/writer/event_file_writer.py", line 233, in run
self._record_writer.write(data)
File "/LIMPQ/env/lib/python3.7/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write
self._writer.write(header + header_crc + data + footer_crc)
File "/wsu/home/gn/gn75/gn7599/LIMPQ/env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 766, in write
self.fs.append(self.filename, file_content, self.binary_mode)
File "/LIMPQ/env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 160, in append
self._write(filename, file_content, "ab" if binary_mode else "a")
File "/LIMPQ/env/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 166, in _write
f.write(compatify(file_content))
OSError: [Errno 5] Input/output error
INFO - Training [8][ 2640/20019] Loss 3.074509 Top1 50.600142 Top5 74.356652 BatchTime 0.343744 LR 0.038725
INFO - --------------------------------------------------------------------------------------------------------------
INFO - Training [8][ 2660/20019] Loss 3.073985 Top1 50.612077 Top5 74.376762 BatchTime 0.343832 LR 0.038725
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1466105, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803435 milliseconds before timi$
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1466105, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803435 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 610 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 611) of binary: /LIMPQ/env/bin/python
Traceback (most recent call last):
File "/pre-compiled/python/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/pre-compiled/python/3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/LIMPQ/env/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in <module>
main()
File "/LIMPQ/env/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/LIMPQ/env/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/LIMPQ/env/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
)(*cmd_args)
File "/LIMPQ/env/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/LIMPQ/env/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
====================================================
main.py FAILED
----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-30_17:45:05
host : vohost
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 611)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 611
====================================================