Large Jobs Stuck Waiting in Store Based Barrier DDP

Hi there! I’m currently using DDP to initialize a job on 32 compute nodes but it seems to be failing as not all workers are joining in even though the script is successfully running on all nodes. Each task starts successfully but then it seems only certain ranks are actually joining in during dist.init_process_group, given the waiting message 2022-01-06,00:00:41 | INFO | Rank 15 | Waiting in store based barrier to initialize process group for rank: 15, key: store_based_barrier_key:1 (world_size=128, worker_count=43, timeout=0:30:00).

My current slurm script looks like:

#SBATCH --account=cstdl
#SBATCH --nodes=32
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=24
#SBATCH --time=00:05:00
#SBATCH --partition=booster
#SBATCH --job-name=open_clip_debugging

# load low-level libraries
ml purge
eval "$(/p/project/ccstdl/gordon2/miniconda3/bin/conda shell.bash hook)" # init conda
conda activate open_clip
ml use $OTHERSTAGES
ml GCC/10.3.0
ml NCCL/2.10.3-1-CUDA-11.3
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_IB_TIMEOUT = 22
export NCCL_IGNORE_CPU_AFFINITY=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MASTER_PORT=12802
export WORLD_SIZE=128

### get the first node name as master address - customized for vgg slurm
### e.g. master(gnodee[2-5],gnoded1) == gnodee2
echo "NODELIST="${SLURM_NODELIST}
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR

cd /p/scratch/ccstdl/gordon2/open_clip
export PYTHONPATH="$PYTHONPATH:$PWD/src"
srun python -u src/training/main.py \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data="/p/scratch/ccstdl/gordon2/CC3M/train/{00000..03318}.tar" \
    --warmup 10000 \
    --batch-size=8 \
    --lr=5e-4 \
    --wd=0.1 \
    --epochs=1 \
    --workers=23 \
    --model ViT-B/32 \
    --name "32nodes" \
    --dist-url="env://"

What other things should I be doing in my script (slurm or python) to ensure that these ranks are properly connecting?

Happy to provide more details if needed :)!

Are you running on 4 GPUs on 32 compute nodes resulting in a world_size of 128? If so what is the --workers=23 argument referring to?

To debug this, can you look at all other ranks to see what they are doing? For example, you should see logs like Added key: xxx to store for rank: yyy" This would help in debugging why certain ranks might not have written to the store.

Based on the error you shared above, it looks like only 43/128 ranks reported in.

--workers refers to my DataLoaders's number of workers, sorry for the confusion.

The following is the log file for the whole run it’s a bit of a behemoth: 32-node-fail.out · GitHub

One of the exports is incorrect so I’m running a job to fix that… But otherwise some of the print statements I include print the global rank as well as the address they are initializing to. All seems to be in order there but they get stuck during the process group initialization:

    if args.distributed:
        print(f"Starting rank {args.rank}")
        print(f"Initing to IP {os.environ['MASTER_ADDR']}")
        dist.init_process_group(
            backend=args.dist_backend,
            init_method=args.dist_url,
            world_size=args.world_size,
            rank=args.rank,
        )

If any other decorators or debugging terms would be helpful I’m glad to rerun with those attached!

Update: even with the NCCL_IB_TIMEOUT fixed back at 22 there was no success.

Your statement, given that log statement, seems to be correct. I’m unsure of how to debug further into the TCPStore initialization that’s going on. Each rank is firing up (as can be seen from the logs) but they’re most definitely not all joining in…

Thanks for sharing this file, when I looked through that file it seems like there are only 43 workers that report “Added key: store_based_barrier_key:1 to store for rank…”, which lines up with only 43/128 ranks reporting in. It seems like all ranks enter init_process_group but some of them get stuck after that and don’t reach the store based barrier logic to update the store.

Would it be possible to identify one such stuck rank and then attach gdb as follows to see its full stack trace:

gdb -p <pid>
thread apply all bt

Another option is to use the python faulthandler (faulthandler — Dump the Python traceback — Python 3.10.1 documentation) to dump the traceback of all threads for a stuck process. dump_traceback_later could be useful here to dump the traceback of a process after X seconds of timeout (indicating that it is stuck).

Not sure if I’ll be able to get access to the actual compute nodes so I tried out the faulthandler and have the following output: 32-node-faulthandler.out · GitHub. Not sure if much can be made of this, but thank you for your time.

Happy to try any other methods you need too!

Thanks for sharing the faulthandler output, that does definitely help in narrowing down where the problem might come from. However, the place where the nodes are stuck is where rest of the logic resides in C++ and hence the python stack doesn’t cover this. In particular, the following would help in debugging this further:

  1. If possible, get access only to Rank 0 of the compute node and get a full gdb stack trace as I mentioned above.
  2. When we print Starting rank, let’s also print args.dist_backend, args.dist_url and args.world_size there as well for that particular rank. In particular, I’m interested in rank 0 here as well.
  3. Set torch._C._set_print_stack_traces_on_fatal_signal(True) for rank 0 and see if it produces C++ stack traces? If not, we probably need to kill rank 0 with a fatal signal (ex: kill -9) to produce a C++ stack trace for that particular rank.

Another thing, you could try if possible is to manually check TCP connectivity between rank 0 machine and another machine whose rank is stuck in the TCP rendezvous. For example, check if the stuck rank can reach MASTER_ADDR:MASTER_PORT on rank 0.

My colleague Ross Wightman solved this through some network sleuthing. I hope this helps someone in the future!

He writes:

I spent quite a bit of time hacking around and mapping out network interfaces on the worker nodes via logging / prints in the train script.

The main network interface that is mapped to the default hostname is what we were using for the PyTorch distributed.init_process_groups TCPStore group initialization. As Cade has reported, as the nodes scaled up, quite a number of them were getting stuck trying to communicate with the master node and the whole init would fail to move forward.

I installed a PyTorch nightly env (1.11) as a PT engineer said they had some better logs in there. Slightly better I suppose, but I was just seeing socket connection timeouts being reported (which means a bunch of nodes were failing to make the connection to master). That could be many things, including routing issues, port/firewall blocking, rate limiting, capacity overload.

I started mapping out the other interfaces and figured out the pattern for the ip over infiniband interfaces. Hostname patterns as follows

enp225s0f0 - jwb0000 - the machine hostname, 2-port ethernet nic?
ib0 - jwb0000i - infiniband
ib1 - jwb0000iu - infiniband
ib2 - jwb0000iv - infiniband
ib3 - jwb0000iw - infiniband
… there are other ib interfaces as well but skipping

I assume the enp225 interface is for control / storage. I imagine it’s a 10gbe+ adapter (didn’t check yet) so should be plenty capable. Perhaps there are extra firewall rules, port range restrictions, or routing on there that impact the TCPStore handshake (any node being able to establish a TCP connection on a specified port) ?

I’m not sure if all infiniband adapters can be treated equally? I assume they are all ‘cluster interconnect’ for MPI / NCCL / etc.

I got past the startup hang issues by using the hostname for ib0 instead of the default machine hostname… basically append an ‘i’ to the hostname for MASTER_ADDR in the launch script. The nodes do start a bit staggered in that number, but everything appears to work, early nodes log that they are waiting, the laggards come online and then off to the races!

It’d be nice to have some more details on the network setup / intentions. And also what restrictions are with regards to port ranges that should be open and free to use, etc.

Also, to test this out, make sure you dial back the debugging env vars, especially TORCH_DISTRIBUTED_* vars. They force the creation of a secondary GLOO group that hangs because it tries to use the wrong interface (it goes back to the main interface because it looks up the ip itself based on socket.gethostname() – not a good idea in a large scale network).

Ross

TLDR: Ensure your $MASTER_ADDR refers to an address capable of handling all these processes, in our case that was pointing to infiniband by essentially just appending an “i” to the $MASTER_ADDR.

1 Like