Hi everyone, and thanks in advance for any help.
I’m running a slightly modified run_clm.py script with vary number of A100 GPUs (4-8) on a single node, and keep getting the ChildFailedError right after the training/evaluation ends.
I train/evaluate GPT2 (smallest model) on the OpenWebText dataset.
An example how I run my evaluation code from a shell script is as follow:
GPU=1,2,3,4,5
export TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO
export CUDA_VISIBLE_DEVICES=$GPUtorchrun
–standalone
–nnodes=1
–nproc_per_node=${NUM_GPU}
run_clm.py
–model_name_or_path ${MODEL}
–dataset_name ${DS_NAME}
–preprocessing_num_workers 16
–logging_steps 5000
–save_steps ${SAVE_STEPS}
–do_eval
–per_device_eval_batch_size ${EVAL_BATCH}
–seed ${RANDOM}
–evaluation_strategy steps
–logging_dir ${OUTPUT_DIR}
–output_dir ${OUTPUT_DIR}
–overwrite_output_dir
–ddp_timeout 324000 \
And getting the following error:
100%|██████████| 2209/2209 [39:03<00:00, 1.06s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4041 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4042 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4043 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4045 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 4044) of binary: /venv/bin/python3
Traceback (most recent call last):
File “/venv/bin/torchrun”, line 8, in
sys.exit(main())
File “/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 345, in wrapper
return f(*args, **kwargs)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 719, in main
run(args)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 710, in run
elastic_launch(
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-12-21_21:57:29
host : december-ds-2h4b6-5hpkj
rank : 3 (local_rank: 3)
exitcode : -9 (pid: 4044)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 4044
============================================================
Full log (after truncating unnecessary logs of tqdm, etc):
12/21/2022 20:37:39 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 20:37:39 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 20:37:39 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 20:37:39 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=324000,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=5000,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=passive,
log_on_each_node=True,
logging_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/test_saved_data_eval,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=5000,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_hf,
optim_args=None,
output_dir=.cache/results/GPT2_Compression/baseline_results/OpenWebText/test_saved_data_eval,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[‘wandb’],
resume_from_checkpoint=None,
run_name=openwebtext_inference,
save_on_each_node=False,
save_steps=1000,
save_strategy=steps,
save_total_limit=None,
seed=6298,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
12/21/2022 20:37:39 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False
12/21/2022 20:37:39 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
[INFO|configuration_utils.py:654] 2022-12-21 20:37:40,050 >> loading configuration file config.json from cache at /store/.cache/huggingface/hub/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file vocab.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/vocab.json
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file merges.txt from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/merges.txt
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file tokenizer.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/tokenizer.json
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1799] 2022-12-21 20:37:41,351 >> loading file tokenizer_config.json from cache at None
[INFO|configuration_utils.py:654] 2022-12-21 20:37:41,351 >> loading configuration file config.json from cache at .cache/datasets/processed/openwebtext/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
[INFO|modeling_utils.py:2204] 2022-12-21 20:37:44,384 >> loading weights file pytorch_model.bin from cache at /store/.cache/huggingface/hub/models–gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/pytorch_model.bin
[INFO|modeling_utils.py:2708] 2022-12-21 20:37:50,669 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
======================================================
[INFO|modeling_utils.py:2716] 2022-12-21 20:37:50,669 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
======================================================
december-ds-2h4b6-5hpkj:4041:4041 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
december-ds-2h4b6-5hpkj:4042:4042 [1] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4044:4044 [3] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4043:4043 [2] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4045:4045 [4] NCCL INFO Bootstrap : Using eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4042:4042 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
======================================================
december-ds-2h4b6-5hpkj:4042:4042 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
december-ds-2h4b6-5hpkj:4044:4044 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
december-ds-2h4b6-5hpkj:4042:4042 [1] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4042:4042 [1] NCCL INFO Using network Socket
december-ds-2h4b6-5hpkj:4043:4043 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
december-ds-2h4b6-5hpkj:4045:4045 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
======================================================
december-ds-2h4b6-5hpkj:4044:4044 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
======================================================
december-ds-2h4b6-5hpkj:4045:4045 [4] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
======================================================
december-ds-2h4b6-5hpkj:4043:4043 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
december-ds-2h4b6-5hpkj:4043:4043 [2] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4043:4043 [2] NCCL INFO Using network Socket
december-ds-2h4b6-5hpkj:4044:4044 [3] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4045:4045 [4] NCCL INFO NET/Socket : Using [0]eth0:10.42.92.136<0>
december-ds-2h4b6-5hpkj:4044:4044 [3] NCCL INFO Using network Socket
december-ds-2h4b6-5hpkj:4045:4045 [4] NCCL INFO Using network Socket
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Setting affinity for GPU 4 to ffff,fff00000,00ffffff,f0000000
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Trees [0] -1/-1/-1->4->3 [1] -1/-1/-1->4->3
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Setting affinity for GPU 5 to ffff,fff00000,00ffffff,f0000000
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Setting affinity for GPU 1 to 0fffff,ff000000,0fffffff
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Setting affinity for GPU 2 to 0fffff,ff000000,0fffffff
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Setting affinity for GPU 3 to 0fffff,ff000000,0fffffff
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Channel 00 : 2[3e000] → 3[88000] via direct shared memory
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Channel 00 : 4[89000] → 0[1b000] via direct shared memory
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Channel 01 : 2[3e000] → 3[88000] via direct shared memory
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Channel 01 : 4[89000] → 0[1b000] via direct shared memory
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Channel 00 : 0[1b000] → 1[3d000] via direct shared memory
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Channel 00 : 1[3d000] → 2[3e000] via P2P/IPC
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Channel 01 : 0[1b000] → 1[3d000] via direct shared memory
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Channel 00 : 3[88000] → 4[89000] via P2P/IPC
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Channel 01 : 1[3d000] → 2[3e000] via P2P/IPC
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Channel 01 : 3[88000] → 4[89000] via P2P/IPC
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Channel 00 : 4[89000] → 3[88000] via P2P/IPC
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Connected all rings
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Channel 01 : 4[89000] → 3[88000] via P2P/IPC
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Channel 00 : 1[3d000] → 0[1b000] via direct shared memory
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Channel 01 : 1[3d000] → 0[1b000] via direct shared memory
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Channel 00 : 3[88000] → 2[3e000] via direct shared memory
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Channel 01 : 3[88000] → 2[3e000] via direct shared memory
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Channel 00 : 2[3e000] → 1[3d000] via P2P/IPC
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Channel 01 : 2[3e000] → 1[3d000] via P2P/IPC
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO Connected all trees
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
december-ds-2h4b6-5hpkj:4042:4139 [1] NCCL INFO comm 0x7f97e8002fb0 rank 1 nranks 5 cudaDev 1 busId 3d000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4043:4141 [2] NCCL INFO comm 0x7f5d80002fb0 rank 2 nranks 5 cudaDev 2 busId 3e000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4044:4140 [3] NCCL INFO comm 0x7fee20002fb0 rank 3 nranks 5 cudaDev 3 busId 88000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4041:4138 [0] NCCL INFO comm 0x7f21a4002fb0 rank 0 nranks 5 cudaDev 0 busId 1b000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4045:4142 [4] NCCL INFO comm 0x7f7078002fb0 rank 4 nranks 5 cudaDev 4 busId 89000 - Init COMPLETE
december-ds-2h4b6-5hpkj:4041:4041 [0] NCCL INFO Launch mode Parallel
12/21/2022 20:56:29 - INFO - main - *** Evaluate ***
[INFO|trainer.py:703] 2022-12-21 20:56:29,129 >> The following columns in the evaluation set don’t have a corresponding argument inGPT2LMHeadModel.forward
and have been ignored: special_tokens_mask. If special_tokens_mask are not expected byGPT2LMHeadModel.forward
, you can safely ignore this message.
[INFO|trainer.py:2944] 2022-12-21 20:56:29,133 >> ***** Running Evaluation *****
[INFO|trainer.py:2946] 2022-12-21 20:56:29,133 >> Num examples = 88340
[INFO|trainer.py:2949] 2022-12-21 20:56:29,133 >> Batch size = 8
0%| | 0/2209 [00:00<?, ?it/s]
0%| | 2/2209 [00:00<18:23, 2.00it/s]
0%| | 3/2209 [00:02<26:56, 1.36it/s]
0%| | 4/2209 [00:03<31:30, 1.17it/s]
…
…
100%|██████████| 2209/2209 [39:03<00:00, 1.06s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4041 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4042 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4043 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4045 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 4044) of binary: /venv/bin/python3
Traceback (most recent call last):
File “/venv/bin/torchrun”, line 8, in
sys.exit(main())
File “/venv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 345, in wrapper
return f(*args, **kwargs)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 719, in main
run(args)
File “/venv/lib/python3.8/site-packages/torch/distributed/run.py”, line 710, in run
elastic_launch(
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./code/gpt2/Model-Compression-Research-Package/examples/transformers/language-modeling/run_clm.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-12-21_21:57:29
host : december-ds-2h4b6-5hpkj
rank : 3 (local_rank: 3)
exitcode : -9 (pid: 4044)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 4044
============================================================
Versions:
According to running pip freeze --local
:
torch==1.10.2+cu113
torchaudio==0.10.2+cu113
torchvision==0.11.3+cu113
nvidia-ml-py==11.495.46
multiprocess==0.70.14
packaging==21.3
transformers==4.25.1
huggingface-hub==0.11.1
absl-py==1.0.0
accelerate==0.15.0
aiohttp==3.8.3
aiosignal==1.3.1
async-timeout==4.0.2
attrs==22.1.0
blessed==1.19.1
cachetools==5.0.0
certifi==2021.10.8
charset-normalizer==2.0.12
click==8.1.3
datasets==2.7.1
deepspeed==0.6.0
dill==0.3.6
docker-pycreds==0.4.0
evaluate==0.4.0
fairscale==0.4.6
filelock==3.8.2
frozenlist==1.3.3
fsspec==2022.11.0
gitdb==4.0.10
GitPython==3.1.29
google-auth==2.6.0
google-auth-oauthlib==0.4.6
gpustat==1.0.0
grpcio==1.44.0
hjson==3.0.2
idna==3.3
importlib-metadata==4.11.2
joblib==1.2.0
Markdown==3.3.6
model-compression-research @ file:///
multidict==6.0.3
ninja==1.10.2.3
nltk==3.8
numpy==1.22.3
oauthlib==3.2.0
pandas==1.5.2
pathtools==0.1.2
Pillow==9.0.1
pkg_resources==0.0.0
promise==2.3
protobuf==3.19.4
psutil==5.9.0
py-cpuinfo==8.0.0
pyarrow==10.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==3.0.7
python-dateutil==2.8.2
pytz==2022.7
PyYAML==6.0
regex==2022.10.31
requests==2.27.1
requests-oauthlib==1.3.1
responses==0.18.0
rsa==4.8
scikit-learn==1.2.0
scipy==1.9.3
sentencepiece==0.1.97
sentry-sdk==1.12.0
setproctitle==1.3.2
shortuuid==1.0.11
six==1.16.0
sklearn==0.0.post1
smmap==5.0.0
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
threadpoolctl==3.1.0
tokenizers==0.13.2
tqdm==4.63.0
typing_extensions==4.1.1
urllib3==1.26.13
wandb==0.13.7
wcwidth==0.2.5
Werkzeug==2.0.3
xxhash==3.1.0
yarl==1.8.2
zipp==3.7.0
Notes:
- The error occurs in training and in evaluation.
- In order to eliminate the option of timeout I deliberately fixed very high value for timeout.
- I tried to run using torchrun and using torch.distributed.launch and faced the same issue.
- The number of samples in my training/eval doesn’t affect and the issue remain.
- I track my memory usage and OOM is not the case here (kinda wish it was).
- The error occurs only in distributed setup. When not using distributed, or when using it with a single GPU, the problem doesn’t pop.
Would really appreciate any help on the subject, since I’m really stuck on my research till I figure it out