Hello , I submitted a 4-node task with 1GPU for each node.
But it exit with exception.
Some of the log information is as follows:
NCCL WARN Connect to 10.38.10.112<21724> failed : Connection refused
The strange thing is that none of the 4 nodes’s ip is 10.38.10.112<21724>. I don’t know why it will try to connect the ip and the port .
Besides,
I have set the NCCL_SOCKET_IFNAME to “^lo,docker”
self.dist_backend: nccl
self.dist_init_method: file:///home/storage15/huangying/tools/espnet/egs2/voxforge/asr1/vox.init
self.dist_world_size: 4
self.dist_rank: 1
auto allocate gpu device: 0
devices ids is 0
----------------------------------------------------details-------------------------------------------------------------
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO Call to connect returned Connection refused, retrying
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO Call to connect returned Connection refused, retrying
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO Call to connect returned Connection refused, retrying
tj1-asr-train-v100-13:941227:943077 [0] include/socket.h:390 NCCL WARN Connect to 10.38.10.112<21724> failed : Connection refused
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO bootstrap.cc:100 -> 2
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO bootstrap.cc:326 -> 2
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO init.cc:695 -> 2
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO init.cc:951 -> 2
tj1-asr-train-v100-13:941227:943077 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
/home/storage15/huangying/tools/anaconda3/envs/py36/lib/python3.6/site-packages/librosa/util/decorators.py:9: NumbaDeprecationWarning: An import was requested from a module that has moved location.
Import of 'jit' requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.
from numba.decorators import jit as optional_jit
/home/storage15/huangying/tools/anaconda3/envs/py36/bin/python3 /home/storage15/huangying/tools/espnet/espnet2/bin/asr_train.py --use_preprocessor true --bpemodel none --token_type char --token_list data/token_list/char/tokens.txt --non_linguistic_symbols none --train_data_path_and_name_and_type dump/fbank_pitch/tr_en/feats.scp,speech,kaldi_ark --train_data_path_and_name_and_type dump/fbank_pitch/tr_en/text,text,text --valid_data_path_and_name_and_type dump/fbank_pitch/dt_en/feats.scp,speech,kaldi_ark --valid_data_path_and_name_and_type dump/fbank_pitch/dt_en/text,text,text --train_shape_file exp/asr_stats/train/speech_shape --train_shape_file exp/asr_stats/train/text_shape.char --valid_shape_file exp/asr_stats/valid/speech_shape --valid_shape_file exp/asr_stats/valid/text_shape.char --resume true --fold_length 800 --fold_length 150 --output_dir exp/asr_train_asr_transformer_fbank_pitch_char_normalize_confnorm_varsFalse --ngpu 1 --dist_init_method file:///home/storage15/huangying/tools/espnet/egs2/voxforge/asr1/vox.init --multiprocessing_distributed false --dist_launcher queue.pl --dist_world_size 4 --config conf/train_asr_transformer.yaml --input_size=83 --normalize=global_mvn --normalize_conf stats_file=exp/asr_stats/train/feats_stats.npz --normalize_conf norm_vars=False
Traceback (most recent call last):
File "/home/storage15/huangying/tools/anaconda3/envs/py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/storage15/huangying/tools/anaconda3/envs/py36/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/storage15/huangying/tools/espnet/espnet2/bin/asr_train.py", line 23, in <module>
main()
File "/home/storage15/huangying/tools/espnet/espnet2/bin/asr_train.py", line 19, in main
ASRTask.main(cmd=cmd)
File "/home/storage15/huangying/tools/espnet/espnet2/tasks/abs_task.py", line 842, in main
cls.main_worker(args)
File "/home/storage15/huangying/tools/espnet/espnet2/tasks/abs_task.py", line 1174, in main_worker
distributed_option=distributed_option,
File "/home/storage15/huangying/tools/espnet/espnet2/train/trainer.py", line 163, in run
else None
File "/home/storage15/huangying/tools/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 303, in __init__
self.broadcast_bucket_size)
File "/home/storage15/huangying/tools/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 485, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL version 2.4.8
Setting NCCL_SOCKET_IFNAME
Finished NCCL_SOCKET_IFNAME
^lo,docker
self.dist_backend: nccl
self.dist_init_method: file:///home/storage15/huangying/tools/espnet/egs2/voxforge/asr1/vox.init
self.dist_world_size: 4
self.dist_rank: 1
auto allocate gpu device: 0
devices ids is 0
# Accounting: time=107 threads=1
# Finished at Mon May 18 14:45:28 CST 2020 with status 1