Error waiting on exit barrier

vmasip · February 22, 2024, 4:28pm

Hello

I am using distributed pytorch. The environment is a singularity container, with nccl 2.9.9 . It’s inside nodes with infiniband at HPC with slurm. Normally executing 2 nodes 1 gpu or 2 nodes 4 gpu’s. The code is github Yolov6.

torch 1.12
torchvision 0.13

I init the group like this:
dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", rank=args.rank, world_size=args.world_size, timeout = timedelta(seconds=7200))

The process works with an small dataset. But when the dataset is big, this happens:

Inferencing model in train datasets.:  88%|▉| 1092/1241 [05:07<00:41,  3.62it/s]
Inferencing model in train datasets.:  88%|▉| 1093/1241 [05:07<00:40,  3.62it/s]
Inferencing model in train datasets.:  88%|▉| 1094/1241 [05:07<00:40,  **3.62iERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 304.7670805454254 seconds**
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 906, in _exit_barrier
    store_util.barrier(
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    agent_data = get_all(store, rank, key_prefix, world_size)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:580 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd8 (0x7fff6e2f2428 in /opt/conda/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xcc (0x7fff6e2ecd8c in /opt/conda/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a0 (0x7fff9f641880 in /opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4c (0x7fff9f642cac in /opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x90 (0x7fff9f642dc0 in /opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6c (0x7fff9f6052fc in /opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xb29d38 (0x7fffa2249d38 in /opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x25622c (0x7fffa197622c in /opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x2c4960 (0x132f94960 in /opt/conda/bin/python)
frame #9: _PyObject_MakeTpCall + 0xcc (0x132d4e26c in /opt/conda/bin/python)
frame #10: <unknown function> + 0x2a4bb4 (0x132f74bb4 in /opt/conda/bin/python)
frame #11: _PyEval_EvalFrameDefault + 0x8418 (0x132d36038 in /opt/conda/bin/python)
frame #12: <unknown function> + 0x5c5f4 (0x132d2c5f4 in /opt/conda/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x7104 (0x132d34d24 in /opt/conda/bin/python)
frame #14: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #15: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #16: _PyEval_EvalFrameDefault + 0x7104 (0x132d34d24 in /opt/conda/bin/python)
frame #17: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #18: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x6f40 (0x132d34b60 in /opt/conda/bin/python)
frame #20: <unknown function> + 0x5c5f4 (0x132d2c5f4 in /opt/conda/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x7e18 (0x132d35a38 in /opt/conda/bin/python)
frame #22: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #23: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x7e18 (0x132d35a38 in /opt/conda/bin/python)
frame #25: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #26: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #27: PyVectorcall_Call + 0x90 (0x132d4d8a0 in /opt/conda/bin/python)
frame #28: _PyObject_Call + 0x1a8 (0x132d4dbe8 in /opt/conda/bin/python)
frame #29: PyCFunction_Call + 0x44 (0x132d4dce4 in /opt/conda/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x389c (0x132d314bc in /opt/conda/bin/python)
frame #31: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #32: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x7e18 (0x132d35a38 in /opt/conda/bin/python)
frame #34: <unknown function> + 0x5c5f4 (0x132d2c5f4 in /opt/conda/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x7104 (0x132d34d24 in /opt/conda/bin/python)
frame #36: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #37: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #38: _PyObject_FastCallDictTstate + 0x8c (0x132d4e55c in /opt/conda/bin/python)
frame #39: _PyObject_Call_Prepend + 0xd4 (0x132d4e914 in /opt/conda/bin/python)
frame #40: <unknown function> + 0xef654 (0x132dbf654 in /opt/conda/bin/python)
frame #41: _PyObject_Call + 0x94 (0x132d4dad4 in /opt/conda/bin/python)
frame #42: PyCFunction_Call + 0x44 (0x132d4dce4 in /opt/conda/bin/python)
frame #43: _PyEval_EvalFrameDefault + 0x389c (0x132d314bc in /opt/conda/bin/python)
frame #44: <unknown function> + 0x5c5f4 (0x132d2c5f4 in /opt/conda/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x7104 (0x132d34d24 in /opt/conda/bin/python)
frame #46: <unknown function> + 0x5c5f4 (0x132d2c5f4 in /opt/conda/bin/python)
frame #47: PyVectorcall_Call + 0x90 (0x132d4d8a0 in /opt/conda/bin/python)
frame #48: _PyObject_Call + 0x1a8 (0x132d4dbe8 in /opt/conda/bin/python)
frame #49: PyCFunction_Call + 0x44 (0x132d4dce4 in /opt/conda/bin/python)
frame #50: _PyEval_EvalFrameDefault + 0x389c (0x132d314bc in /opt/conda/bin/python)
frame #51: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #52: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x7104 (0x132d34d24 in /opt/conda/bin/python)
frame #54: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #55: _PyEval_EvalCodeWithName + 0xa4 (0x132e306a4 in /opt/conda/bin/python)
frame #56: PyEval_EvalCodeEx + 0x74 (0x132e30744 in /opt/conda/bin/python)
frame #57: PyEval_EvalCode + 0x48 (0x132e307c8 in /opt/conda/bin/python)
frame #58: <unknown function> + 0x1b0cbc (0x132e80cbc in /opt/conda/bin/python)
frame #59: <unknown function> + 0x1b0e30 (0x132e80e30 in /opt/conda/bin/python)
frame #60: PyRun_FileExFlags + 0x114 (0x132e85b84 in /opt/conda/bin/python)
frame #61: PyRun_SimpleFileExFlags + 0x230 (0x132e85f20 in /opt/conda/bin/python)
frame #62: PyRun_AnyFileExFlags + 0xc8 (0x132e86918 in /opt/conda/bin/python)
frame #63: Py_RunMain + 0xa00 (0x132d3bc60 in /opt/conda/bin/python)

t/s]
Inferencing model in train datasets.:  88%|▉| 1095/1241 [05:07<00:40,  3.62it/s]
Inferencing model in train datasets.:  88%|▉| 1096/1241 [05:08<00:40,  3.62it/s]
Inferencing model in train datasets.:  88%|▉| 1097/1241 [05:08<00:39,  3.63it/s]
Inferencing model in train datasets.:  88%|▉| 1098/1241 [05:08<00:39,  3.62it/s]
Inferencing model in train datasets.:  89%|▉| 1099/1241 [05:09<00:39,  3.63it/s]

So, what is happening? After 2 epochs comes the evaler. I guess this invocation comes from the evaler when tries to saves the model or checkpoint. Ironically, all the prints of the main training process are printed, so even after this error, the process doesn’t stop and ends. Even the inferencing part progress bar(Inferencing model in train datasets) finish too.

This is the slurm:

#!/bin/bash
#SBATCH --job-name=yolo_test
#SBATCH --qos=debug
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=160
#SBATCH --gres=gpu:4
#SBATCH --output=/gpfs/scratch/X/X/slurm_logs/airurban/yolo_test_%j.out
#SBATCH --error=/gpfs/scratch/X/X/slurm_logs/airurban/yolo_test_%j.err


#export NCCL_DEBUG_SUBSYS=COLL
#export LOGLEVEL=INFO
export NCCL_DEBUG=INFO
export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_SHOW_CPP_STACKTRACES=1
export NCCL_IB_TIMEOUT=22

# ---> I'VE USING THIS VARIABLES A LOT WITH SAME RESULTS
export NCCL_ASYNC_ERROR_HANDLING=1
#export NCCL_DESYNC_DEBUG=1
#export ENABLE_NCCL_HEALTH_CHECK=1

echo "SLURM_JOB_NODELIST="$SLURM_JOB_NODELIST
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
echo Node IP: $head_node_ip

export MASTER_ADDR=$head_node_ip
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_GPUS_ON_NODE))

echo "MASTER_ADDR="$MASTER_ADDR
echo "Head Node IP:="$head_node_ip
echo "SLURM_PROCID="$SLURM_PROCID
echo "SLURM_NNODES="$SLURM_NNODES
echo "SLURM_JOB_ID="$SLURM_JOB_ID
echo "SLURM_GPUS_ON_NODE="$SLURM_GPUS_ON_NODE
echo "WORLD_SIZE="$WORLD_SIZE


### Loading environment
module load cuda/10.2 cudnn/8.0.5 nccl/2.9.9 singularity
srun singularity exec --nv /apps/SINGULARITY/images/numpy1.26-torch-vision.sif torchrun \
		--nproc_per_node $SLURM_GPUS_ON_NODE \
		--nnodes $SLURM_NNODES \
		--rdzv_id $SLURM_JOB_ID \
		--rdzv_backend c10d \
		--rdzv_endpoint $MASTER_ADDR:29500 \
		--log_dir "/gpfs/scratch/X/Xslurm_logs/airurban/torchrun_logs" \
		 tools/train.py \
			--batch 32 \
			--bs_per_gpu=8 \
			--conf configs/yolov6n.py \
			--data data/dataset.yaml \
			--fuse_ab \
			--device 0,1,2,3 \
			--workers 32 \
			--eval-interval 1 \
			--epochs 2

I’ve been reading a lot about this in forums and topics here. According to it timeout is pretty important. My question is, why is not taking into account 7200s. I mean, the process when the dataset is not huge works fine, till the end too without errors. When is huge at somepoint is delaying but the main process is not waiting for childs, isn’t it? So why increasing timeout is not working? 7200 is being well used. If I use max level at the log, :


[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10.2.1.39, 29500).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:29500.
[I socket.cpp:710] [c10d - trace] The server socket on [p9r3n03]:29500 is not yet listening (errno: 111 - Connection refused), will retry.
[I socket.cpp:417] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:462] [c10d - debug] The server socket is attempting to listen on [::]:29500.
[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10.2.1.39, 29500).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:29500.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:29500 on [p9r3n03]:48738.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10.2.1.39, 29500).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:29500.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:29500 on [p9r3n03]:48740.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [p9r3n03]:48738.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [p9r3n03]:48740.
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:29500.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [p9r3n14]:42492.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:29500 on [p9r3n14]:42492.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10.2.1.39, 29500).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:29500.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [p9r3n14]:42494.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:29500 on [p9r3n14]:42494.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
Using 4 GPU for training... 
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:710] [c10d - trace] The server socket on [p9r3n03]:54679 is not yet listening (errno: 111 - Connection refused), will retry.
[I socket.cpp:710] [c10d - trace] The server socket on [p9r3n03]:54679 is not yet listening (errno: 111 - Connection refused), will retry.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:710] [c10d - trace] The server socket on [p9r3n03]:54679 is not yet listening (errno: 111 - Connection refused), will retry.
training args are: Namespace(data_path='data/dataset.yaml', conf_file='configs/yolov6n.py', img_size=640, batch_size=32, epochs=2, workers=32, device='0,1,2,3', eval_interval=1, eval_final_only=False, heavy_eval_range=50, check_images=False, check_labels=False, output_dir='./runs/train', name='exp', dist_url='env://', gpu_count=0, local_rank=0, resume=False, write_trainbatch_tb=False, stop_aug_last_n_epoch=15, save_ckpt_on_last_n_epoch=-1, distill=False, distill_feat=False, quant=False, calib=False, teacher_model_path=None, temperature=20, fuse_ab=True, bs_per_gpu=8, rank=0, world_size=8, save_dir='runs/train/exp26')

Initializing process group... 
[I socket.cpp:417] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:462] [c10d - debug] The server socket is attempting to listen on [::]:54679.
[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:54679.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n03]:55806.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n03]:55806.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51710.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51714.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51716.
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51712.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51718.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n14]:51710.
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51722.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n14]:51714.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n14]:51716.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n14]:51718.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n14]:51722.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51720.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n14]:51712.
[I ProcessGroupNCCL.cpp:587] [Rank 7] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 7200000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 7] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 5] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 7200000
USE_HIGH_PRIORITY_STREAM: 0

Why does this keeps giving me error at 300? I’ve read that by default this init timeout is 30 min. So, are there two different timeouts? I think probably my process is also lasting 30 min, but the error marks as the cause the 300s timeout.

I’ve read same issue to other people:

github.com/huggingface/accelerate

Accelerate encounters socket timeout, but torchrun works.

opened 08:52AM - 27 Jun 23 UTC

closed 12:33PM - 04 Aug 23 UTC

WindVChen

solved

### System Info ```Shell - `Accelerate` version: 0.20.3 - Platform: Linux-3.10….0-957.el7.x86_64-x86_64-with-glibc2.10 - Python version: 3.8.0 - Numpy version: 1.22.0 - PyTorch version (GPU?): 2.0.0+cu117 (False) - PyTorch XPU available: False - System RAM: 503.62 GB - `Accelerate` default config: - compute_environment: LOCAL_MACHINE - distributed_type: NO - mixed_precision: no - use_cpu: False - num_processes: 1 - machine_rank: 0 - num_machines: 1 - gpu_ids: all - rdzv_backend: static - same_network: True - main_training_function: main - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: [] ``` ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks - [X] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`) - [ ] My own task or dataset (give details below) ### Reproduction ```python import torch import time import os import socket print("============", socket.gethostname(), os.environ["CUDA_VISIBLE_DEVICES"], "===========") num = torch.cuda.device_count() infos = [torch.cuda.get_device_properties(i) for i in range(num)] print(infos) ``` ### Expected behavior ## Run the below script, I will encounter `"socket timeout"` error when using `Accelerate`. However, when I leverage `torchrun`, it works. Please see the output behind the script: ``` #SBATCH --nodes=2 #SBATCH --ntasks=2 #SBATCH --cpus-per-task=1 #SBATCH --job-name=test #SBATCH --gres=gpu:2 #SBATCH --output=%x-%j.out #SBATCH --error=%x-%j.err # log the sbatch environment echo "start time: $(date)" echo "SLURM_JOBID="$SLURM_JOBID echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST echo "SLURM_JOB_PARTITION"=$SLURM_JOB_PARTITION echo "SLURM_NODES"=$SLURM_NNODES echo "SLURM_SUBMIT_DIR"=$SLURM_SUBMIT_DIR # Training setup GPUS_PER_NODE=2 MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) MASTER_PORT=17333 NNODES=$SLURM_NNODES NODE_RANK=$SLURM_PROCID WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES)) All_ADDR=($(scontrol show hostnames $SLURM_JOB_NODELIST)) echo "CUDA_VISIBLE_DEVICES"=$CUDA_VISIBLE_DEVICES echo "All_ADDR"=$All_ADDR echo "MASTER_ADDR"=$MASTER_ADDR echo "GPU_NUM"=$WORLD_SIZE export NCCL_DEBUG=INFO # export NCCL_IB_DISABLE=1 export NCCL_SOCKET_IFNAME=eth0 export TORCH_CPP_LOG_LEVEL=INFO export CUDA_LAUNCH_BLOCKING=1 export NCCL_ASYNC_ERROR_HANDLING=1 export LOGLEVEL=INFO export TORCH_DISTRIBUTED_DEBUG=DETAIL export TORCH_SHOW_CPP_STACKTRACES=1 # handle timeouts export NCCL_IB_TIMEOUT=20 LAUNCHER="accelerate launch \ --multi_gpu \ --mixed_precision=fp16 \ --num_machines $NNODES \ --num_processes $WORLD_SIZE \ --main_process_port $MASTER_PORT \ --machine_rank $SLURM_PROCID \ --main_process_ip=$MASTER_ADDR \ " TORCHRUN="torchrun --nproc_per_node=2 --nnodes=2 --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT" SRUN_ARGS=" \ --kill-on-bad-exit=1 \ -N 1 -n 1 --gres=gpu:2 --exclusive " for mrank in $(seq 0 $((SLURM_NNODES - 1))) do echo "$mrank address"=${All_ADDR[mrank]} srun $SRUN_ARGS -w ${All_ADDR[mrank]} bash -c "$LAUNCHER --gpu_ids=$CUDA_VISIBLE_DEVICES test.py" & # srun $SRUN_ARGS -w ${All_ADDR[mrank]} bash -c "$TORCHRUN --node_rank=$mrank test.py" & done wait echo "END TIME: $(date)" ``` ## Accelerate result: ``` [I debug.cpp:49] [c10d] The debug level is set to DETAIL. [I debug.cpp:49] [c10d] The debug level is set to DETAIL. INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : test.py min_nodes : 2 max_nodes : 2 nproc_per_node : 2 run_id : none rdzv_backend : static rdzv_endpoint : 10-198-34-55:17333 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 0 monitor_interval : 5 log_dir : None metrics_cfg : {} INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_e74q42cw/none_imtwzje1 INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group [I socket.cpp:442] [c10d - debug] The server socket will attempt to listen on an IPv6 address. [I socket.cpp:492] [c10d - debug] The server socket is attempting to listen on [::]:17333. [I socket.cpp:566] [c10d] The server socket has started to listen on [::]:17333. [I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10-198-34-55, 17333). [I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [10-198-34-55]:17333. [I socket.cpp:768] [c10d - trace] The server socket on [10-198-34-55]:17333 is not yet listening (errno: 111 - Connection refused), will retry. INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : test.py min_nodes : 2 max_nodes : 2 nproc_per_node : 2 run_id : none rdzv_backend : static rdzv_endpoint : 10-198-34-55:17333 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 0 monitor_interval : 5 log_dir : None metrics_cfg : {} INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_z0aeed3j/none_cr0qf9mt INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group [I socket.cpp:442] [c10d - debug] The server socket will attempt to listen on an IPv6 address. [I socket.cpp:492] [c10d - debug] The server socket is attempting to listen on [::]:17333. [I socket.cpp:566] [c10d] The server socket has started to listen on [::]:17333. [I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10-198-34-55, 17333). [I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [10-198-34-55]:17333. [I socket.cpp:787] [c10d] The client socket has connected to [10-198-34-55]:17333 on [10-198-34-55]:59258. [I socket.cpp:295] [c10d - debug] The server socket on [::]:17333 has accepted a connection from [10-198-34-55]:59258. [I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [10-198-34-55]:17333. [I socket.cpp:787] [c10d] The client socket has connected to [10-198-34-55]:17333 on [10-198-34-73]:43360. [I socket.cpp:295] [c10d - debug] The server socket on [::]:17333 has accepted a connection from [10-198-34-73]:43360. [I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10-198-34-55, 17333). [I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [10-198-34-55]:17333. [I socket.cpp:787] [c10d] The client socket has connected to [10-198-34-55]:17333 on [10-198-34-73]:43362. [I socket.cpp:295] [c10d - debug] The server socket on [::]:17333 has accepted a connection from [10-198-34-73]:43362. [I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10-198-34-55, 17333). [I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [10-198-34-55]:17333. [I socket.cpp:295] [c10d - debug] The server socket on [::]:17333 has accepted a connection from [10-198-34-55]:59260. [I socket.cpp:787] [c10d] The client socket has connected to [10-198-34-55]:17333 on [10-198-34-55]:59260. Traceback (most recent call last): File "/mnt/cache/home/.local.s0.3.6_py38/bin/accelerate", line 8, in <module> sys.exit(main()) File "/mnt/cache/home/.local.s0.3.6_py38/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/mnt/cache/home/.local.s0.3.6_py38/lib/python3.8/site-packages/accelerate/commands/launch.py", line 932, in launch_command multi_gpu_launcher(args) File "/mnt/cache/home/.local.s0.3.6_py38/lib/python3.8/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher distrib_run.run(args) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent result = agent.run() File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run result = self._invoke_run(role) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run self._initialize_workers(self._worker_group) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers self._rendezvous(worker_group) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 549, in _rendezvous workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 624, in _assign_worker_ranks role_infos = self._share_and_gather(store, group_rank, group_world_size, spec) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 661, in _share_and_gather role_infos_bytes = store_util.synchronize( File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize agent_data = get_all(store, rank, key_prefix, world_size) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all data = store.get(f"{prefix}{idx}") RuntimeError: Socket Timeout Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fda47b1e4d7 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7fda47ae8434 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7fda73088658 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7fda73089302 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7fda73089389 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fda73048851 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #6: <unknown function> + 0xb226c2 (0x7fda876aa6c2 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x3b7050 (0x7fda86f3f050 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: PyCFunction_Call + 0x56 (0x55c5ddeecc96 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #9: _PyObject_MakeTpCall + 0x22f (0x55c5ddea931f in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #10: <unknown function> + 0x18227c (0x55c5ddef927c in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #11: <unknown function> + 0xfb37e (0x55c5dde7237e in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #12: _PyFunction_Vectorcall + 0x10b (0x55c5ddef8b1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #13: <unknown function> + 0xfb3cc (0x55c5dde723cc in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #14: _PyEval_EvalCodeWithName + 0x2d2 (0x55c5ddef7c32 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #15: _PyFunction_Vectorcall + 0x1e3 (0x55c5ddef8bf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #16: <unknown function> + 0xfb37e (0x55c5dde7237e in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #17: _PyFunction_Vectorcall + 0x10b (0x55c5ddef8b1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #18: <unknown function> + 0xfb35d (0x55c5dde7235d in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #19: _PyEval_EvalCodeWithName + 0x7df (0x55c5ddef813f in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #20: _PyFunction_Vectorcall + 0x1e3 (0x55c5ddef8bf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #21: PyVectorcall_Call + 0x71 (0x55c5ddea8b01 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #22: _PyEval_EvalFrameDefault + 0x22b0 (0x55c5ddf32770 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #23: _PyEval_EvalCodeWithName + 0x7df (0x55c5ddef813f in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #24: _PyFunction_Vectorcall + 0x1e3 (0x55c5ddef8bf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #25: <unknown function> + 0xfb35d (0x55c5dde7235d in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #26: _PyFunction_Vectorcall + 0x10b (0x55c5ddef8b1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #27: PyVectorcall_Call + 0x71 (0x55c5ddea8b01 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #28: _PyEval_EvalFrameDefault + 0x22b0 (0x55c5ddf32770 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #29: _PyEval_EvalCodeWithName + 0x7df (0x55c5ddef813f in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #30: _PyFunction_Vectorcall + 0x1e3 (0x55c5ddef8bf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #31: <unknown function> + 0xfb35d (0x55c5dde7235d in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #32: _PyFunction_Vectorcall + 0x10b (0x55c5ddef8b1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #33: PyVectorcall_Call + 0x71 (0x55c5ddea8b01 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #34: _PyEval_EvalFrameDefault + 0x22b0 (0x55c5ddf32770 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #35: _PyEval_EvalCodeWithName + 0x7df (0x55c5ddef813f in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #36: _PyFunction_Vectorcall + 0x1e3 (0x55c5ddef8bf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #37: <unknown function> + 0xfb35d (0x55c5dde7235d in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #38: _PyEval_EvalCodeWithName + 0x2d2 (0x55c5ddef7c32 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #39: _PyFunction_Vectorcall + 0x1e3 (0x55c5ddef8bf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #40: <unknown function> + 0xfb35d (0x55c5dde7235d in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #41: _PyEval_EvalCodeWithName + 0x2d2 (0x55c5ddef7c32 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #42: _PyFunction_Vectorcall + 0x1e3 (0x55c5ddef8bf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #43: PyVectorcall_Call + 0x71 (0x55c5ddea8b01 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #44: _PyEval_EvalFrameDefault + 0x22b0 (0x55c5ddf32770 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #45: _PyEval_EvalCodeWithName + 0x7df (0x55c5ddef813f in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #46: _PyFunction_Vectorcall + 0x1e3 (0x55c5ddef8bf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #47: <unknown function> + 0xfb35d (0x55c5dde7235d in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #48: _PyFunction_Vectorcall + 0x10b (0x55c5ddef8b1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #49: <unknown function> + 0xfb3cc (0x55c5dde723cc in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #50: _PyEval_EvalCodeWithName + 0x2d2 (0x55c5ddef7c32 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #51: _PyObject_FastCallDict + 0x20c (0x55c5ddef973c in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #52: _PyObject_Call_Prepend + 0x63 (0x55c5ddef99e3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #53: <unknown function> + 0x182aea (0x55c5ddef9aea in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #54: PyObject_Call + 0x70 (0x55c5ddea8cc0 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #55: _PyEval_EvalFrameDefault + 0x22b0 (0x55c5ddf32770 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #56: _PyFunction_Vectorcall + 0x10b (0x55c5ddef8b1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #57: <unknown function> + 0xfb37e (0x55c5dde7237e in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #58: _PyFunction_Vectorcall + 0x10b (0x55c5ddef8b1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #59: <unknown function> + 0xfb3cc (0x55c5dde723cc in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #60: _PyFunction_Vectorcall + 0x10b (0x55c5ddef8b1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #61: <unknown function> + 0xfb37e (0x55c5dde7237e in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #62: _PyFunction_Vectorcall + 0x10b (0x55c5ddef8b1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #63: <unknown function> + 0xfb3cc (0x55c5dde723cc in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) Traceback (most recent call last): File "/mnt/cache/home/.local.s0.3.6_py38/bin/accelerate", line 8, in <module> sys.exit(main()) File "/mnt/cache/home/.local.s0.3.6_py38/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/mnt/cache/home/.local.s0.3.6_py38/lib/python3.8/site-packages/accelerate/commands/launch.py", line 932, in launch_command multi_gpu_launcher(args) File "/mnt/cache/home/.local.s0.3.6_py38/lib/python3.8/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher distrib_run.run(args) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent result = agent.run() File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run result = self._invoke_run(role) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run self._initialize_workers(self._worker_group) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers self._rendezvous(worker_group) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 549, in _rendezvous workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 624, in _assign_worker_ranks role_infos = self._share_and_gather(store, group_rank, group_world_size, spec) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 661, in _share_and_gather role_infos_bytes = store_util.synchronize( File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize agent_data = get_all(store, rank, key_prefix, world_size) File "/mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all data = store.get(f"{prefix}{idx}") RuntimeError: Socket Timeout Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff5d49924d7 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7ff5d495c434 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7ff5ffefc658 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7ff5ffefd302 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7ff5ffefd389 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff5ffebc851 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #6: <unknown function> + 0xb226c2 (0x7ff61451e6c2 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x3b7050 (0x7ff613db3050 in /mnt/cache/share_data/home_origin/env/.grouplib/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #8: PyCFunction_Call + 0x56 (0x55710f53ec96 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #9: _PyObject_MakeTpCall + 0x22f (0x55710f4fb31f in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #10: <unknown function> + 0x18227c (0x55710f54b27c in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #11: <unknown function> + 0xfb37e (0x55710f4c437e in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #12: _PyFunction_Vectorcall + 0x10b (0x55710f54ab1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #13: <unknown function> + 0xfb3cc (0x55710f4c43cc in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #14: _PyEval_EvalCodeWithName + 0x2d2 (0x55710f549c32 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #15: _PyFunction_Vectorcall + 0x1e3 (0x55710f54abf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #16: <unknown function> + 0xfb37e (0x55710f4c437e in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #17: _PyFunction_Vectorcall + 0x10b (0x55710f54ab1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #18: <unknown function> + 0xfb35d (0x55710f4c435d in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #19: _PyEval_EvalCodeWithName + 0x7df (0x55710f54a13f in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #20: _PyFunction_Vectorcall + 0x1e3 (0x55710f54abf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #21: PyVectorcall_Call + 0x71 (0x55710f4fab01 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #22: _PyEval_EvalFrameDefault + 0x22b0 (0x55710f584770 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #23: _PyEval_EvalCodeWithName + 0x7df (0x55710f54a13f in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #24: _PyFunction_Vectorcall + 0x1e3 (0x55710f54abf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #25: <unknown function> + 0xfb35d (0x55710f4c435d in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #26: _PyFunction_Vectorcall + 0x10b (0x55710f54ab1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #27: PyVectorcall_Call + 0x71 (0x55710f4fab01 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #28: _PyEval_EvalFrameDefault + 0x22b0 (0x55710f584770 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #29: _PyEval_EvalCodeWithName + 0x7df (0x55710f54a13f in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #30: _PyFunction_Vectorcall + 0x1e3 (0x55710f54abf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #31: <unknown function> + 0xfb35d (0x55710f4c435d in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #32: _PyFunction_Vectorcall + 0x10b (0x55710f54ab1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #33: PyVectorcall_Call + 0x71 (0x55710f4fab01 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #34: _PyEval_EvalFrameDefault + 0x22b0 (0x55710f584770 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #35: _PyEval_EvalCodeWithName + 0x7df (0x55710f54a13f in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #36: _PyFunction_Vectorcall + 0x1e3 (0x55710f54abf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #37: <unknown function> + 0xfb35d (0x55710f4c435d in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #38: _PyEval_EvalCodeWithName + 0x2d2 (0x55710f549c32 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #39: _PyFunction_Vectorcall + 0x1e3 (0x55710f54abf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #40: <unknown function> + 0xfb35d (0x55710f4c435d in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #41: _PyEval_EvalCodeWithName + 0x2d2 (0x55710f549c32 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #42: _PyFunction_Vectorcall + 0x1e3 (0x55710f54abf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #43: PyVectorcall_Call + 0x71 (0x55710f4fab01 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #44: _PyEval_EvalFrameDefault + 0x22b0 (0x55710f584770 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #45: _PyEval_EvalCodeWithName + 0x7df (0x55710f54a13f in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #46: _PyFunction_Vectorcall + 0x1e3 (0x55710f54abf3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #47: <unknown function> + 0xfb35d (0x55710f4c435d in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #48: _PyFunction_Vectorcall + 0x10b (0x55710f54ab1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #49: <unknown function> + 0xfb3cc (0x55710f4c43cc in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #50: _PyEval_EvalCodeWithName + 0x2d2 (0x55710f549c32 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #51: _PyObject_FastCallDict + 0x20c (0x55710f54b73c in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #52: _PyObject_Call_Prepend + 0x63 (0x55710f54b9e3 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #53: <unknown function> + 0x182aea (0x55710f54baea in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #54: PyObject_Call + 0x70 (0x55710f4facc0 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #55: _PyEval_EvalFrameDefault + 0x22b0 (0x55710f584770 in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #56: _PyFunction_Vectorcall + 0x10b (0x55710f54ab1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #57: <unknown function> + 0xfb37e (0x55710f4c437e in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #58: _PyFunction_Vectorcall + 0x10b (0x55710f54ab1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #59: <unknown function> + 0xfb3cc (0x55710f4c43cc in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #60: _PyFunction_Vectorcall + 0x10b (0x55710f54ab1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #61: <unknown function> + 0xfb37e (0x55710f4c437e in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #62: _PyFunction_Vectorcall + 0x10b (0x55710f54ab1b in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) frame #63: <unknown function> + 0xfb3cc (0x55710f4c43cc in /mnt/cache/share/spring/conda_envs/miniconda3/envs/s0.3.6_py38/bin/python) srun: error: 10-198-34-73: task 0: Exited with exit code 1 srun: launch/slurm: _step_signal: Terminating StepId=3041244.0 srun: error: 10-198-34-55: task 0: Exited with exit code 1 srun: launch/slurm: _step_signal: Terminating StepId=3041244.1 ``` ## torchrun result: ``` WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : test.py min_nodes : 2 max_nodes : 2 nproc_per_node : 2 run_id : none rdzv_backend : static rdzv_endpoint : 10-198-34-55:17333 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 0 monitor_interval : 5 log_dir : None metrics_cfg : {} INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_zhm3a5iq/none_w75ttsjk INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group [I socket.cpp:566] [c10d] The server socket has started to listen on [::]:17333. [I socket.cpp:787] [c10d] The client socket has connected to [10-198-34-55]:17333 on [10-198-34-55]:57764. WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [I socket.cpp:787] [c10d] The client socket has connected to [10-198-34-55]:17333 on [10-198-34-55]:57766. INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=10-198-34-55 master_port=17333 group_rank=0 group_world_size=2 local_ranks=[0, 1] role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[4, 4] global_world_sizes=[4, 4] INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer. INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_zhm3a5iq/none_w75ttsjk/attempt_0/0/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_zhm3a5iq/none_w75ttsjk/attempt_0/1/error.json INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : test.py min_nodes : 2 max_nodes : 2 nproc_per_node : 2 run_id : none rdzv_backend : static rdzv_endpoint : 10-198-34-55:17333 rdzv_configs : {'rank': 1, 'timeout': 900} max_restarts : 0 monitor_interval : 5 log_dir : None metrics_cfg : {} INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_kcaekac6/none_t4kgg2dv INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group [I socket.cpp:787] [c10d] The client socket has connected to [10-198-34-55]:17333 on [10-198-34-60]:51510. [I socket.cpp:787] [c10d] The client socket has connected to [10-198-34-55]:17333 on [10-198-34-60]:51512. INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=10-198-34-55 master_port=17333 group_rank=1 group_world_size=2 local_ranks=[0, 1] role_ranks=[2, 3] global_ranks=[2, 3] role_world_sizes=[4, 4] global_world_sizes=[4, 4] INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer. INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_kcaekac6/none_t4kgg2dv/attempt_0/0/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_kcaekac6/none_t4kgg2dv/attempt_0/1/error.json INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish. INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0021746158599853516 seconds INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.001215219497680664 seconds ``` ### From the differences between the results, it seems that Accelerate got stuck to `"Rendezvous'ing worker group"`, as there was no output `"Rendezvous complete for workers"`. ### Is there any solution to solve the problem?

maybe this:

github.com/huggingface/transformers

Socket Timeout when using DDP

opened 11:33PM - 05 May 22 UTC

closed 03:03PM - 02 Oct 22 UTC

sajastu

bug

### System Info ```shell - `transformers` version: 4.17.0.dev0 - Platform: …Linux-4.15.0-176-generic-x86_64-with-glibc2.17 - Python version: 3.8.13 - PyTorch version (GPU?): 1.8.2 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: Yes - Using distributed or parallel set-up in script?: Yes (run_summarization.py script) ``` ### Who can help? @patrickvonplaten @patil-suraj ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Reproduction I'm constructing a dataset (.parquet format) that is similar to json format, but has other fields to construct graph for examples in the dataset. When I'm training the model in DDP mode (distributed), I'm getting `RuntimeError: Socket Timeout`. Here is the full stack: ``` Running tokenizer on train dataset #0: 24%|███████████████████████████████████████████████▌ | 7/29 [28:27<1:46:58, 291.73s/ba]Traceback (most recent call last): #1: 24%|███████████████████████████████████████████████▌ | 7/29 [28:54<1:46:07, 289.45s/ba] File "examples/pytorch/summarization/run_summarization.py", line 987, in <module>████████▌ | 7/29 [30:24<1:49:35, 298.88s/ba] main()kenizer on train dataset #3: 24%|███████████████████████████████████████████████▌ | 7/29 [28:46<1:43:47, 283.05s/ba] File "examples/pytorch/summarization/run_summarization.py", line 791, in main█████▊ | 6/29 [27:32<1:57:16, 305.93s/ba] with training_args.main_process_first(desc="train dataset map pre-processing"):████████▌ | 7/29 [27:45<1:42:39, 279.97s/ba] File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/contextlib.py", line 113, in __enter__ | 6/29 [26:27<1:54:13, 297.97s/ba] return next(self.gen)n dataset #7: 21%|████████████████████████████████████████▊ | 6/29 [25:48<1:51:59, 292.15s/ba] File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/training_args.py", line 1264, in main_process_first | 6/29 [26:27<1:52:41, 293.96s/ba] torch.distributed.barrier()set #9: 24%|███████████████████████████████████████████████▌ | 7/29 [29:50<1:45:55, 288.90s/ba] File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: Socket Timeout Killing subprocess 62044 Killing subprocess 62045 Traceback (most recent call last): File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module> main() File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main ``` ### Expected behavior ```shell Running the preprocessing function on each training split. ```

Please, any thought is useful. Thanks.

vmasip · March 7, 2024, 10:26am

I forgot to mention that cuda and nccl versions are:
CUDA 10.2
NCCL 2.9.9

ptrblck · March 7, 2024, 10:21pm

Could you update PyTorch to the latest stable or nightly release and check if you would still run into these issues?

vmasip · March 11, 2024, 9:35am

Hello
It’s an HPC research facility. It’s not easy, sometimes not possible, to update. By now, this is the case. But, new HPC is “opening it’s doors” in a short period of time (weeks, maybe few months). So, if there is no new advice to track/follow the bug with current configuration, I’ll write in a few weeks with new info here. Otherwise, we can follow welcomed new advises.