Keep getting ChildFailedError Error in distributed setup

Hi everyone, and thanks in advance for any help.

I’m running a slightly modified run_clm.py script with vary number of A100 GPUs (4-8) on a single node, and keep getting the ChildFailedError right after the training/evaluation ends.
I train/evaluate GPT2 (smallest model) on the OpenWebText dataset.

An example how I run my evaluation code from a shell script is as follow:

GPU=1,2,3,4,5
export TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO
export CUDA_VISIBLE_DEVICES=$GPU

torchrun
–standalone
–nnodes=1
–nproc_per_node=${NUM_GPU}
run_clm.py
–model_name_or_path ${MODEL}
–dataset_name ${DS_NAME}
–preprocessing_num_workers 16
–logging_steps 5000
–save_steps ${SAVE_STEPS}
–do_eval
–per_device_eval_batch_size ${EVAL_BATCH}
–seed ${RANDOM}
–evaluation_strategy steps
–logging_dir ${OUTPUT_DIR}
–output_dir ${OUTPUT_DIR}
–overwrite_output_dir
–ddp_timeout 324000 \

And getting the following error:

100%|██████████| 2209/2209 [39:03<00:00, 1.06s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4041 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4042 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4043 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4045 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 4044) of binary: /venv/bin/python3
Traceback (most recent call last):
File “/venv/bin/torchrun”, line 8, in
sys.exit(main())
File “/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 345, in wrapper
return f(*args, **kwargs)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 719, in main
run(args)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 710, in run
elastic_launch(
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-12-21_21:57:29
host : december-ds-2h4b6-5hpkj
rank : 3 (local_rank: 3)
exitcode : -9 (pid: 4044)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 4044
============================================================

Full log (after truncating unnecessary logs of tqdm, etc):

12/21/2022 20:37:39 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 20:37:39 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 20:37:39 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 20:37:39 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=324000,
debug=,
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=5000,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=,
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=passive,
log_on_each_node=True,
logging_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/test_saved_data_eval,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=5000,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_hf,
optim_args=None,
output_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/test_saved_data_eval,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[‘wandb’],
resume_from_checkpoint=None,
run_name=openwebtext_inference,
save_on_each_node=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=6298,
sharded_ddp=,
skip_memory_metrics=True,
tf32=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
12/21/2022 20:37:39 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 20:37:39 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
[INFO|configuration_utils.py:654] 2022-12-21 20:37:40,050 >> loading configuration file config.json from cache at /store/.cache/huggingface/hub/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file vocab.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/vocab.json
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file merges.txt from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/merges.txt
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file tokenizer.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/tokenizer.json
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file tokenizer_config.json from cache at None
[INFO|configuration_utils.py:654] 2022-12-21 20:37:41,351 >> loading configuration file config.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|modeling_utils.py:2204] 2022-12-21 20:37:44,384 >> loading weights file pytorch_model.bin from cache at /store/.cache/huggingface/hub/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/pytorch_model.bin
[INFO|modeling_utils.py:2708] 2022-12-21 20:37:50,669 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
======================================================
[INFO|modeling_utils.py:2716] 2022-12-21 20:37:50,669 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
======================================================
december-ds-2h4b6-5hpkj:4041:4041 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
december-ds-2h4b6-5hpkj:4042:4042 [1] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4044:4044 [3] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4043:4043 [2] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4045:4045 [4] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4042:4042 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
======================================================
december-ds-2h4b6-5hpkj:4042:4042 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
december-ds-2h4b6-5hpkj:4044:4044 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
december-ds-2h4b6-5hpkj:4042:4042 [1] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4042:4042 [1] NCCL INFO Using network Socket
december-ds-2h4b6-5hpkj:4043:4043 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
december-ds-2h4b6-5hpkj:4045:4045 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
======================================================
december-ds-2h4b6-5hpkj:4044:4044 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
======================================================
december-ds-2h4b6-5hpkj:4045:4045 [4] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
======================================================
december-ds-2h4b6-5hpkj:4043:4043 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
december-ds-2h4b6-5hpkj:4043:4043 [2] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4043:4043 [2] NCCL INFO Using network Socket
december-ds-2h4b6-5hpkj:4044:4044 [3] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4045:4045 [4] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4044:4044 [3] NCCL INFO Using network Socket
december-ds-2h4b6-5hpkj:4045:4045 [4] NCCL INFO Using network Socket
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Setting affinity for GPU 4 to ffff,fff00000,00ffffff,f0000000
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Trees [0] -1/-1/-1->4->3 [1] -1/-1/-1->4->3
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Setting affinity for GPU 5 to ffff,fff00000,00ffffff,f0000000
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Setting affinity for GPU 1 to 0fffff,ff000000,0fffffff
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Setting affinity for GPU 2 to 0fffff,ff000000,0fffffff
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Setting affinity for GPU 3 to 0fffff,ff000000,0fffffff
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Channel 00 : 2[3e000] → 3[88000] via direct shared memory
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Channel 00 : 4[89000] → 0[1b000] via direct shared memory
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Channel 01 : 2[3e000] → 3[88000] via direct shared memory
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Channel 01 : 4[89000] → 0[1b000] via direct shared memory
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Channel 00 : 0[1b000] → 1[3d000] via direct shared memory
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Channel 00 : 1[3d000] → 2[3e000] via P2P/IPC
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Channel 01 : 0[1b000] → 1[3d000] via direct shared memory
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Channel 00 : 3[88000] → 4[89000] via P2P/IPC
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Channel 01 : 1[3d000] → 2[3e000] via P2P/IPC
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Channel 01 : 3[88000] → 4[89000] via P2P/IPC
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Channel 00 : 4[89000] → 3[88000] via P2P/IPC
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Channel 01 : 4[89000] → 3[88000] via P2P/IPC
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Channel 00 : 1[3d000] → 0[1b000] via direct shared memory
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Channel 01 : 1[3d000] → 0[1b000] via direct shared memory
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Channel 00 : 3[88000] → 2[3e000] via direct shared memory
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Channel 01 : 3[88000] → 2[3e000] via direct shared memory
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Channel 00 : 2[3e000] → 1[3d000] via P2P/IPC
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Channel 01 : 2[3e000] → 1[3d000] via P2P/IPC
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO comm 0x7f97e8002fb0 rank 1 nranks 5 cudaDev 1 busId 3d000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO comm 0x7f5d80002fb0 rank 2 nranks 5 cudaDev 2 busId 3e000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO comm 0x7fee20002fb0 rank 3 nranks 5 cudaDev 3 busId 88000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO comm 0x7f21a4002fb0 rank 0 nranks 5 cudaDev 0 busId 1b000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO comm 0x7f7078002fb0 rank 4 nranks 5 cudaDev 4 busId 89000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO Launch mode Parallel
12/21/2022 20:56:29 - INFO - main - *** Evaluate ***
[INFO|trainer.py:703] 2022-12-21 20:56:29,129 >> The following columns in the evaluation set don’t have a corresponding argument in GPT2LMHeadModel.forward and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by GPT2LMHeadModel.forward, you can safely ignore this message.
[INFO|trainer.py:2944] 2022-12-21 20:56:29,133 >> ***** Running Evaluation *****
[INFO|trainer.py:2946] 2022-12-21 20:56:29,133 >> Num examples = 88340
[INFO|trainer.py:2949] 2022-12-21 20:56:29,133 >> Batch size = 8
0%| | 0/2209 [00:00<?, ?it/s]
0%| | 2/2209 [00:00<18:23, 2.00it/s]
0%| | 3/2209 [00:02<26:56, 1.36it/s]
0%| | 4/2209 [00:03<31:30, 1.17it/s]


100%|██████████| 2209/2209 [39:03<00:00, 1.06s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4041 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4042 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4043 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4045 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 4044) of binary: /venv/bin/python3
Traceback (most recent call last):
File “/venv/bin/torchrun”, line 8, in
sys.exit(main())
File “/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 345, in wrapper
return f(*args, **kwargs)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 719, in main
run(args)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 710, in run
elastic_launch(
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-12-21_21:57:29
host : december-ds-2h4b6-5hpkj
rank : 3 (local_rank: 3)
exitcode : -9 (pid: 4044)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 4044
============================================================

Versions:

According to running pip freeze --local:

torch==1.10.2+cu113
torchaudio==0.10.2+cu113
torchvision==0.11.3+cu113
nvidia-ml-py==11.495.46
multiprocess==0.70.14
packaging==21.3
transformers==4.25.1
huggingface-hub==0.11.1

absl-py==1.0.0
accelerate==0.15.0
aiohttp==3.8.3
aiosignal==1.3.1
async-timeout==4.0.2
attrs==22.1.0
blessed==1.19.1
cachetools==5.0.0
certifi==2021.10.8
charset-normalizer==2.0.12
click==8.1.3
datasets==2.7.1
deepspeed==0.6.0
dill==0.3.6
docker-pycreds==0.4.0
evaluate==0.4.0
fairscale==0.4.6
filelock==3.8.2
frozenlist==1.3.3
fsspec==2022.11.0
gitdb==4.0.10
GitPython==3.1.29
google-auth==2.6.0
google-auth-oauthlib==0.4.6
gpustat==1.0.0
grpcio==1.44.0
hjson==3.0.2
idna==3.3
importlib-metadata==4.11.2
joblib==1.2.0
Markdown==3.3.6
model-compression-research @ file:///
multidict==6.0.3
ninja==1.10.2.3
nltk==3.8
numpy==1.22.3
oauthlib==3.2.0
pandas==1.5.2
pathtools==0.1.2
Pillow==9.0.1
pkg_resources==0.0.0
promise==2.3
protobuf==3.19.4
psutil==5.9.0
py-cpuinfo==8.0.0
pyarrow==10.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==3.0.7
python-dateutil==2.8.2
pytz==2022.7
PyYAML==6.0
regex==2022.10.31
requests==2.27.1
requests-oauthlib==1.3.1
responses==0.18.0
rsa==4.8
scikit-learn==1.2.0
scipy==1.9.3
sentencepiece==0.1.97
sentry-sdk==1.12.0
setproctitle==1.3.2
shortuuid==1.0.11
six==1.16.0
sklearn==0.0.post1
smmap==5.0.0
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
threadpoolctl==3.1.0
tokenizers==0.13.2
tqdm==4.63.0
typing_extensions==4.1.1
urllib3==1.26.13
wandb==0.13.7
wcwidth==0.2.5
Werkzeug==2.0.3
xxhash==3.1.0
yarl==1.8.2
zipp==3.7.0

Notes:

  1. The error occurs in training and in evaluation.
  2. In order to eliminate the option of timeout I deliberately fixed very high value for timeout.
  3. I tried to run using torchrun and using torch.distributed.launch and faced the same issue.
  4. The number of samples in my training/eval doesn’t affect and the issue remain.
  5. I track my memory usage and OOM is not the case here (kinda wish it was).
  6. The error occurs only in distributed setup. When not using distributed, or when using it with a single GPU, the problem doesn’t pop.

Would really appreciate any help on the subject, since I’m really stuck on my research till I figure it out :pray:

1 Like

Hey everyone,
anyone might know something regard it?

Hey guys,

Tried to upgrade my torch version to 1.12.1+cu113.
The error remains, but there are other details in the warnings which I still don’t understand and may be connected to my issue:

[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:29400.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:29400 on [localhost]:51966.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:29400 on [localhost]:51980.
[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:40069.
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43714.
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43722.
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43734.
[I ProcessGroupNCCL.cpp:587] [Rank 4] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 324000000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 4] NCCL watchdog thread started!
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43726.
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43742.
[I ProcessGroupNCCL.cpp:751] [Rank 3] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 3] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 324000000
USE_HIGH_PRIORITY_STREAM: 0
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43754.
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43758.
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43770.
[I ProcessGroupNCCL.cpp:587] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 324000000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 1] NCCL watchdog thread started!
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43776.
[I ProcessGroupNCCL.cpp:587] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 324000000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 2] NCCL watchdog thread started!
[I socket.cpp:725] [c10d] The client socket has connected to [::ffff:10.42.91.117]:40069 on [::ffff:10.42.91.117]:43778.
[I ProcessGroupNCCL.cpp:587] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 324000000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 0] NCCL watchdog thread started!
12/24/2022 17:09:12 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
12/24/2022 17:09:12 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False
12/24/2022 17:09:12 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
12/24/2022 17:09:12 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
12/24/2022 17:09:12 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
12/24/2022 17:09:12 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=324000,
debug=,
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=5000,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=,
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=passive,
log_on_each_node=True,
logging_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/gpt2/distributed_inference,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=5000,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_hf,
optim_args=None,
output_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/gpt2/distributed_inference,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=4,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[‘wandb’],
resume_from_checkpoint=None,
run_name=openwebtext-distributed-inference,
save_on_each_node=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=2520,
sharded_ddp=,
skip_memory_metrics=True,
tf32=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
[INFO|configuration_utils.py:654] 2022-12-24 17:09:12,874 >> loading configuration file config.json from cache at /store/.cache/huggingface/hub/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|configuration_utils.py:706] 2022-12-24 17:09:12,875 >> Model config GPT2Config {
“_name_or_path”: “gpt2”,
“activation_function”: “gelu_new”,
“architectures”: [
“GPT2LMHeadModel”
],
“attn_pdrop”: 0.1,
“bos_token_id”: 50256,
“embd_pdrop”: 0.1,
“eos_token_id”: 50256,
“initializer_range”: 0.02,
“layer_norm_epsilon”: 1e-05,
“model_type”: “gpt2”,
“n_ctx”: 1024,
“n_embd”: 768,
“n_head”: 12,
“n_inner”: null,
“n_layer”: 12,
“n_positions”: 1024,
“reorder_and_upcast_attn”: false,
“resid_pdrop”: 0.1,
“scale_attn_by_inverse_layer_idx”: false,
“scale_attn_weights”: true,
“summary_activation”: null,
“summary_first_dropout”: 0.1,
“summary_proj_to_labels”: true,
“summary_type”: “cls_index”,
“summary_use_proj”: true,
“task_specific_params”: {
“text-generation”: {
“do_sample”: true,
“max_length”: 50
}
},
“transformers_version”: “4.25.1”,
“use_cache”: true,
“vocab_size”: 50257
}
[INFO|tokenization_auto.py:449] 2022-12-24 17:09:13,190 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
[INFO|configuration_utils.py:654] 2022-12-24 17:09:13,508 >> loading configuration file config.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|configuration_utils.py:706] 2022-12-24 17:09:13,509 >> Model config GPT2Config {
“_name_or_path”: “gpt2”,
“activation_function”: “gelu_new”,
“architectures”: [
“GPT2LMHeadModel”
],
“attn_pdrop”: 0.1,
“bos_token_id”: 50256,
“embd_pdrop”: 0.1,
“eos_token_id”: 50256,
“initializer_range”: 0.02,
“layer_norm_epsilon”: 1e-05,
“model_type”: “gpt2”,
“n_ctx”: 1024,
“n_embd”: 768,
“n_head”: 12,
“n_inner”: null,
“n_layer”: 12,
“n_positions”: 1024,
“reorder_and_upcast_attn”: false,
“resid_pdrop”: 0.1,
“scale_attn_by_inverse_layer_idx”: false,
“scale_attn_weights”: true,
“summary_activation”: null,
“summary_first_dropout”: 0.1,
“summary_proj_to_labels”: true,
“summary_type”: “cls_index”,
“summary_use_proj”: true,
“task_specific_params”: {
“text-generation”: {
“do_sample”: true,
“max_length”: 50
}
},
“transformers_version”: “4.25.1”,
“use_cache”: true,
“vocab_size”: 50257
}
[INFO|tokenization_utils_base.py:1799] 2022-12-24 17:09:14,133 >> loading file vocab.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/vocab.json
[INFO|tokenization_utils_base.py:1799] 2022-12-24 17:09:14,133 >> loading file merges.txt from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/merges.txt
[INFO|tokenization_utils_base.py:1799] 2022-12-24 17:09:14,133 >> loading file tokenizer.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/tokenizer.json
[INFO|tokenization_utils_base.py:1799] 2022-12-24 17:09:14,133 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-24 17:09:14,133 >> loading file special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-24 17:09:14,133 >> loading file tokenizer_config.json from cache at None
[INFO|configuration_utils.py:654] 2022-12-24 17:09:14,133 >> loading configuration file config.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|configuration_utils.py:706] 2022-12-24 17:09:14,134 >> Model config GPT2Config {
“_name_or_path”: “gpt2”,
“activation_function”: “gelu_new”,
“architectures”: [
“GPT2LMHeadModel”
],
“attn_pdrop”: 0.1,
“bos_token_id”: 50256,
“embd_pdrop”: 0.1,
“eos_token_id”: 50256,
“initializer_range”: 0.02,
“layer_norm_epsilon”: 1e-05,
“model_type”: “gpt2”,
“n_ctx”: 1024,
“n_embd”: 768,
“n_head”: 12,
“n_inner”: null,
“n_layer”: 12,
“n_positions”: 1024,
“reorder_and_upcast_attn”: false,
“resid_pdrop”: 0.1,
“scale_attn_by_inverse_layer_idx”: false,
“scale_attn_weights”: true,
“summary_activation”: null,
“summary_first_dropout”: 0.1,
“summary_proj_to_labels”: true,
“summary_type”: “cls_index”,
“summary_use_proj”: true,
“task_specific_params”: {
“text-generation”: {
“do_sample”: true,
“max_length”: 50
}
},
“transformers_version”: “4.25.1”,
“use_cache”: true,
“vocab_size”: 50257
}
[INFO|modeling_utils.py:2204] 2022-12-24 17:09:17,271 >> loading weights file pytorch_model.bin from cache at /store/.cache/huggingface/hub/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/pytorch_model.bin
[I ProcessGroupNCCL.cpp:2012] Rank 4 using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[I ProcessGroupNCCL.cpp:2012] Rank 3 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[I ProcessGroupNCCL.cpp:2012] Rank 2 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[INFO|modeling_utils.py:2708] 2022-12-24 17:09:19,025 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
[INFO|modeling_utils.py:2716] 2022-12-24 17:09:19,026 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
/venv/lib/python3.8/site-packages/datasets/dataset_dict.py:1241: FutureWarning: ‘fs’ was is deprecated in favor of ‘storage_options’ in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing ‘storage_options=fs.storage_options’ instead.
warnings.warn(
[I ProcessGroupNCCL.cpp:2012] Rank 1 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[I ProcessGroupNCCL.cpp:2012] Rank 0 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
distributed-4dc7b-wvh75:282:282 [0] NCCL INFO Bootstrap : Using eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:282:282 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
distributed-4dc7b-wvh75:282:282 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
distributed-4dc7b-wvh75:282:282 [0] NCCL INFO NET/Socket : Using [0]eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:282:282 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
distributed-4dc7b-wvh75:286:286 [4] NCCL INFO Bootstrap : Using eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:284:284 [2] NCCL INFO Bootstrap : Using eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:285:285 [3] NCCL INFO Bootstrap : Using eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:283:283 [1] NCCL INFO Bootstrap : Using eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:286:286 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
distributed-4dc7b-wvh75:286:286 [4] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
distributed-4dc7b-wvh75:283:283 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
distributed-4dc7b-wvh75:286:286 [4] NCCL INFO NET/Socket : Using [0]eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:286:286 [4] NCCL INFO Using network Socket
distributed-4dc7b-wvh75:283:283 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
distributed-4dc7b-wvh75:283:283 [1] NCCL INFO NET/Socket : Using [0]eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:283:283 [1] NCCL INFO Using network Socket
distributed-4dc7b-wvh75:285:285 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
distributed-4dc7b-wvh75:285:285 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
distributed-4dc7b-wvh75:285:285 [3] NCCL INFO NET/Socket : Using [0]eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:285:285 [3] NCCL INFO Using network Socket
distributed-4dc7b-wvh75:284:284 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
distributed-4dc7b-wvh75:284:284 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
distributed-4dc7b-wvh75:284:284 [2] NCCL INFO NET/Socket : Using [0]eth0:10.42.91.117<0>
distributed-4dc7b-wvh75:284:284 [2] NCCL INFO Using network Socket
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Setting affinity for GPU 1 to 0fffff,ff000000,0fffffff
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Setting affinity for GPU 3 to 0fffff,ff000000,0fffffff
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Setting affinity for GPU 2 to 0fffff,ff000000,0fffffff
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Trees [0] -1/-1/-1->4->3 [1] -1/-1/-1->4->3
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Setting affinity for GPU 4 to ffff,fff00000,00ffffff,f0000000
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Setting affinity for GPU 5 to ffff,fff00000,00ffffff,f0000000
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Channel 00 : 2[3e000] → 3[88000] via direct shared memory
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Channel 00 : 4[89000] → 0[1b000] via direct shared memory
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Channel 01 : 2[3e000] → 3[88000] via direct shared memory
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Channel 01 : 4[89000] → 0[1b000] via direct shared memory
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Channel 00 : 1[3d000] → 2[3e000] via P2P/IPC
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Channel 01 : 1[3d000] → 2[3e000] via P2P/IPC
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Channel 00 : 3[88000] → 4[89000] via P2P/IPC
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Channel 01 : 3[88000] → 4[89000] via P2P/IPC
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Channel 00 : 0[1b000] → 1[3d000] via direct shared memory
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Channel 01 : 0[1b000] → 1[3d000] via direct shared memory
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Connected all rings
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Connected all rings
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Channel 00 : 3[88000] → 2[3e000] via direct shared memory
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Connected all rings
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Connected all rings
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Channel 01 : 3[88000] → 2[3e000] via direct shared memory
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Channel 00 : 4[89000] → 3[88000] via P2P/IPC
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Connected all rings
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Channel 01 : 4[89000] → 3[88000] via P2P/IPC
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO Connected all trees
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Channel 00 : 1[3d000] → 0[1b000] via direct shared memory
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Channel 01 : 1[3d000] → 0[1b000] via direct shared memory
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Channel 00 : 2[3e000] → 1[3d000] via P2P/IPC
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Channel 01 : 2[3e000] → 1[3d000] via P2P/IPC
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO Connected all trees
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO Connected all trees
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO Connected all trees
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO Connected all trees
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
distributed-4dc7b-wvh75:285:382 [3] NCCL INFO comm 0x7fc5c80030d0 rank 3 nranks 5 cudaDev 3 busId 88000 - Init COMPLETE
distributed-4dc7b-wvh75:283:380 [1] NCCL INFO comm 0x7f75e80030d0 rank 1 nranks 5 cudaDev 1 busId 3d000 - Init COMPLETE
distributed-4dc7b-wvh75:284:383 [2] NCCL INFO comm 0x7f6f700030d0 rank 2 nranks 5 cudaDev 2 busId 3e000 - Init COMPLETE
distributed-4dc7b-wvh75:286:381 [4] NCCL INFO comm 0x7f150c0030d0 rank 4 nranks 5 cudaDev 4 busId 89000 - Init COMPLETE
distributed-4dc7b-wvh75:282:379 [0] NCCL INFO comm 0x7f53ec0030d0 rank 0 nranks 5 cudaDev 0 busId 1b000 - Init COMPLETE
[I ProcessGroupNCCL.cpp:1196] NCCL_DEBUG: INFO
distributed-4dc7b-wvh75:282:282 [0] NCCL INFO Launch mode Parallel
/venv/lib/python3.8/site-packages/datasets/dataset_dict.py:1241: FutureWarning: ‘fs’ was is deprecated in favor of ‘storage_options’ in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing ‘storage_options=fs.storage_options’ instead.
warnings.warn(
/venv/lib/python3.8/site-packages/datasets/dataset_dict.py:1241: FutureWarning: ‘fs’ was is deprecated in favor of ‘storage_options’ in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing ‘storage_options=fs.storage_options’ instead.
warnings.warn(
/venv/lib/python3.8/site-packages/datasets/dataset_dict.py:1241: FutureWarning: ‘fs’ was is deprecated in favor of ‘storage_options’ in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing ‘storage_options=fs.storage_options’ instead.
warnings.warn(
/venv/lib/python3.8/site-packages/datasets/dataset_dict.py:1241: FutureWarning: ‘fs’ was is deprecated in favor of ‘storage_options’ in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing ‘storage_options=fs.storage_options’ instead.
warnings.warn(
12/24/2022 17:12:10 - INFO - main - *** Evaluate ***
[INFO|trainer.py:703] 2022-12-24 17:12:10,400 >> The following columns in the evaluation set don’t have a corresponding argument in GPT2LMHeadModel.forward and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by GPT2LMHeadModel.forward, you can safely ignore this message.
[INFO|trainer.py:2944] 2022-12-24 17:12:10,405 >> ***** Running Evaluation *****
[INFO|trainer.py:2946] 2022-12-24 17:12:10,405 >> Num examples = 88340
[INFO|trainer.py:2949] 2022-12-24 17:12:10,405 >> Batch size = 4
0%| | 0/4417 [00:00<?, ?it/s]
0%| | 2/4417 [00:00<21:13, 3.47it/s]


100%|█████████▉| 4416/4417 [43:27<00:00, 1.69it/s]
100%|██████████| 4417/4417 [43:28<00:00, 1.69it/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 282 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 284 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 285 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 286 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 283) of binary: /venv/bin/python3
Traceback (most recent call last):
File “/venv/bin/torchrun”, line 8, in
sys.exit(main())
File “/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 345, in wrapper
return f(*args, **kwargs)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 761, in main
run(args)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 752, in run
elastic_launch(
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-12-24_17:58:51
host : distributed-4dc7b-wvh75
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 283)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 283
============================================================

Hope anyone can help me to locate the issue :pray:

cc @d4l3k for TorchElastic questions

Hey @IdoAmit198 , IIUC, the child failure indicates the training process crashed, and the SIGKILL was because TorchElastic detected a failure on peer process and then killed other training processes. It will be helpful to narrow down which part of the training code caused the original failure. Is it possible to add logs to figure out which line caused the failure?