I used huggingface to train bert and mlm to train Bert. I found that errors would be reported when the input data was more than 2 million, such as timeout error about nccl. This problem would not occur when the amount of data was small, but also when the input string was too long. Perhaps the solution is to increase the threshold, but how to set it?
ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809815 milliseconds before timing out.
Grouping texts in chunks of 128: 66%|██████▌ | 2290000/3483738 [30:10<16:51, 1179.89 examples/s]
Grouping texts in chunks of 128: 0%| | 0/3483738 [00:00<?, ? examples/s][E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
Grouping texts in chunks of 128: 66%|██████▌ | 2291000/3483738 [30:11<16:23, 1212.55 examples/s]
Grouping texts in chunks of 128: 66%|██████▌ | 2292000/3483738 [30:11<16:00, 1241.28 examples/s]
Grouping texts in chunks of 128: 66%|██████▌ | 2293000/3483738 [30:12<15:40, 1265.45 examples/s]
Grouping texts in chunks of 128: 66%|██████▌ | 2294000/3483738 [30:13<15:27, 1282.38 examples/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 45839 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 45840) of binary: /data/soft/anaconda3/envs/hu/bin/python3
Traceback (most recent call last):
File "/data/soft/anaconda3/envs/hu/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/soft/anaconda3/envs/hu/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/soft/anaconda3/envs/hu/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in <module>
main()
File "/data/soft/anaconda3/envs/hu/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/data/soft/anaconda3/envs/hu/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/data/soft/anaconda3/envs/hu/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/data/soft/anaconda3/envs/hu/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/soft/anaconda3/envs/hu/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
/data/show_train/mlm_200w_768/train/run_mlm.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-24_14:24:25
host : temporary-bigdata-offline-gpu-node03
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 45840)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 45840
======================================================