I am currently using torchrun for training(this), and when running the script, I encountered an error with exitcode -7. I couldn’t find any relevant information about this error internet, and there is no explanation of the code’s meaning in torch documentation either. The error message is not clear enough to understand the problem.
Is there someone here who can explain the issue behind the exitcode -7
?
I have ruled out the possibility that it’s due to insufficient resources.
log:
(vicuna) root@vicuna-696bd59b59-cplpn:/raid/minxiang83/Vicuna# sh train.sh
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/transformers/training_args.py:1388: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/transformers/training_args.py:1388: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.55s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.76s/it]
Loading data...
#train 872, #eval 18
Formatting inputs...Skip in lazy mode
Formatting inputs...Skip in lazy mode
wandb: Currently logged in as: 8311. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.4
wandb: Run data is saved locally in /raid/minxiang83/Vicuna/wandb/run-20230609_035606-awl8j1ko
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run dazzling-totem-42
wandb: ⭐️ View project at https://wandb.ai/8311/huggingface
wandb: 🚀 View run at https://wandb.ai/8311/huggingface/runs/awl8j1ko
0%| | 0/39 [00:00<?, ?it/s]vicuna-696bd59b59-cplpn:17421:17421 [0] NCCL INFO Bootstrap : Using eth0:192.168.200.188<0>
vicuna-696bd59b59-cplpn:17421:17421 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
vicuna-696bd59b59-cplpn:17421:17421 [0] NCCL INFO cudaDriverVersion 11060
NCCL version 2.14.3+cuda11.7
vicuna-696bd59b59-cplpn:17422:17422 [1] NCCL INFO cudaDriverVersion 11060
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.200.188<0>
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Using network Socket
vicuna-696bd59b59-cplpn:17422:17422 [1] NCCL INFO Bootstrap : Using eth0:192.168.200.188<0>
vicuna-696bd59b59-cplpn:17422:17422 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
vicuna-696bd59b59-cplpn:17422:17752 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
vicuna-696bd59b59-cplpn:17422:17752 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.200.188<0>
vicuna-696bd59b59-cplpn:17422:17752 [1] NCCL INFO Using network Socket
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000
vicuna-696bd59b59-cplpn:17422:17752 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
vicuna-696bd59b59-cplpn:17422:17752 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000,00000000
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Channel 00/04 : 0 1
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Channel 01/04 : 0 1
vicuna-696bd59b59-cplpn:17422:17752 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Channel 02/04 : 0 1
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Channel 03/04 : 0 1
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 17421) of binary: /root/miniconda3/envs/vicuna/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/vicuna/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
FastChat/fastchat/train/train_mem.py FAILED
-----------------------------------------------------
Failures:
[1]:
time : 2023-06-09_03:56:32
host : vicuna-696bd59b59-cplpn
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 17422)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 17422
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-06-09_03:56:32
host : vicuna-696bd59b59-cplpn
rank : 0 (local_rank: 0)
exitcode : -7 (pid: 17421)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 17421
=====================================================