About nccl timeout

ingalex0520 · March 24, 2023, 6:36am

I used huggingface to train bert and mlm to train Bert. I found that errors would be reported when the input data was more than 2 million, such as timeout error about nccl. This problem would not occur when the amount of data was small, but also when the input string was too long. Perhaps the solution is to increase the threshold, but how to set it?

ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809815 milliseconds before timing out.
Grouping texts in chunks of 128:  66%|██████▌   | 2290000/3483738 [30:10<16:51, 1179.89 examples/s]
Grouping texts in chunks of 128:   0%|          | 0/3483738 [00:00<?, ? examples/s][E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
Grouping texts in chunks of 128:  66%|██████▌   | 2291000/3483738 [30:11<16:23, 1212.55 examples/s]
Grouping texts in chunks of 128:  66%|██████▌   | 2292000/3483738 [30:11<16:00, 1241.28 examples/s]
Grouping texts in chunks of 128:  66%|██████▌   | 2293000/3483738 [30:12<15:40, 1265.45 examples/s]
Grouping texts in chunks of 128:  66%|██████▌   | 2294000/3483738 [30:13<15:27, 1282.38 examples/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 45839 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 45840) of binary: /data/soft/anaconda3/envs/hu/bin/python3
Traceback (most recent call last):
  File "/data/soft/anaconda3/envs/hu/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/soft/anaconda3/envs/hu/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/soft/anaconda3/envs/hu/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/data/soft/anaconda3/envs/hu/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/data/soft/anaconda3/envs/hu/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/data/soft/anaconda3/envs/hu/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/data/soft/anaconda3/envs/hu/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/soft/anaconda3/envs/hu/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
/data/show_train/mlm_200w_768/train/run_mlm.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-24_14:24:25
  host      : temporary-bigdata-offline-gpu-node03
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 45840)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 45840
======================================================

baiyuting · April 26, 2024, 8:47am

I met a similar question. nccl timeout occurs when data is large. Do you solve it now?