What is the meaning of "exitcode" in torchrun?

I am currently using torchrun for training(this), and when running the script, I encountered an error with exitcode -7. I couldn’t find any relevant information about this error internet, and there is no explanation of the code’s meaning in torch documentation either. The error message is not clear enough to understand the problem.

Is there someone here who can explain the issue behind the exitcode -7?

I have ruled out the possibility that it’s due to insufficient resources.

log:

(vicuna) root@vicuna-696bd59b59-cplpn:/raid/minxiang83/Vicuna# sh train.sh 
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/transformers/training_args.py:1388: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/transformers/training_args.py:1388: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.55s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.76s/it]
Loading data...
#train 872, #eval 18
Formatting inputs...Skip in lazy mode
Formatting inputs...Skip in lazy mode
wandb: Currently logged in as: 8311. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.4
wandb: Run data is saved locally in /raid/minxiang83/Vicuna/wandb/run-20230609_035606-awl8j1ko
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run dazzling-totem-42
wandb: ⭐️ View project at https://wandb.ai/8311/huggingface
wandb: 🚀 View run at https://wandb.ai/8311/huggingface/runs/awl8j1ko
  0%|                                                                                                                                                                                                | 0/39 [00:00<?, ?it/s]vicuna-696bd59b59-cplpn:17421:17421 [0] NCCL INFO Bootstrap : Using eth0:192.168.200.188<0>
vicuna-696bd59b59-cplpn:17421:17421 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
vicuna-696bd59b59-cplpn:17421:17421 [0] NCCL INFO cudaDriverVersion 11060
NCCL version 2.14.3+cuda11.7
vicuna-696bd59b59-cplpn:17422:17422 [1] NCCL INFO cudaDriverVersion 11060
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.200.188<0>
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Using network Socket
vicuna-696bd59b59-cplpn:17422:17422 [1] NCCL INFO Bootstrap : Using eth0:192.168.200.188<0>
vicuna-696bd59b59-cplpn:17422:17422 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
vicuna-696bd59b59-cplpn:17422:17752 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
vicuna-696bd59b59-cplpn:17422:17752 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.200.188<0>
vicuna-696bd59b59-cplpn:17422:17752 [1] NCCL INFO Using network Socket
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000
vicuna-696bd59b59-cplpn:17422:17752 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
vicuna-696bd59b59-cplpn:17422:17752 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000,00000000
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Channel 00/04 :    0   1
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Channel 01/04 :    0   1
vicuna-696bd59b59-cplpn:17422:17752 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Channel 02/04 :    0   1
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Channel 03/04 :    0   1
vicuna-696bd59b59-cplpn:17421:17751 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 17421) of binary: /root/miniconda3/envs/vicuna/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/vicuna/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
FastChat/fastchat/train/train_mem.py FAILED
-----------------------------------------------------
Failures:
[1]:
  time      : 2023-06-09_03:56:32
  host      : vicuna-696bd59b59-cplpn
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 17422)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 17422
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-09_03:56:32
  host      : vicuna-696bd59b59-cplpn
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 17421)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 17421
=====================================================

Signal 7 (SIGBUS) is a bus error described here which usually indicates “that a process is trying to access memory that the CPU cannot physically address”.

Hi @ptrblck ,
Thank you for your response. I would like to inquire further: What could be the reasons for being unable to access the environment within Docker? Do you have any suggestions for resolving the issue? Thank you in advance!

I’m unsure if I understand the question correctly, but the error is not pointing towards not being allowed to access an environment but a memory violation.
You could try to use num_workers=0 in your DataLoader and check if this would help. If not you might need to use e.g. gdb to check the stacktrace which might allow you to further isolate the issue.

1 Like

Hi @ptrblck ,
Thank you for the additional clarification. I have found the reason. I need to configure the shm-size of my Docker to resolve the bus error.

Great to hear that!
Did you narrow it down by reducing the number of workers? If so, this would have been my next suggestion :wink:

1 Like