Unable to executed example.py in examples/distributed/ddp across nodes?

nsriniva03 · March 11, 2022, 7:09pm

Hello,

I am trying to test out distributed training across nodes using the example.py script provided in examples/distributed/ddp.

The code is not executed beyond dist.init_process_group(backend=“nccl”) statement and just hangs.

I have used the following command to run the code.

On Node 1:
export NCCL_DEBUG=INFO ; export NCCL_DEBUG_SUBSYS=ALL; python /home/nisha/anaconda3/envs/pytorch1.8_env/lib/python3.8/site-packages/torch/distributed/launch.py --nnode=2 --node_rank=0 --nproc_per_node=8 --master_addr=162.157.89.143 --master_port=20051 example.py --local_world_size=8

Output Node 1:

[17182] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘2’, ‘WORLD_SIZE’: ‘16’}
[17180] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘0’, ‘WORLD_SIZE’: ‘16’}
[17183] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘3’, ‘WORLD_SIZE’: ‘16’}
[17184] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘4’, ‘WORLD_SIZE’: ‘16’}
[17186] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘6’, ‘WORLD_SIZE’: ‘16’}
[17187] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘7’, ‘WORLD_SIZE’: ‘16’}
[17181] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘1’, ‘WORLD_SIZE’: ‘16’}
[17185] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘5’, ‘WORLD_SIZE’: ‘16’}

On Node 2:
export NCCL_DEBUG=INFO ; export NCCL_DEBUG_SUBSYS=ALL; python /home/nisha/anaconda3/envs/pytorch1.8_env/lib/python3.8/site-packages/torch/distributed/launch.py --nnode=2 --node_rank=1 --nproc_per_node=8 --master_addr=162.157.89.143 --master_port=20051 example.py --local_world_size=8

Output on node 2:

[36854] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘15’, ‘WORLD_SIZE’: ‘16’}
[36851] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘12’, ‘WORLD_SIZE’: ‘16’}
[36847] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘8’, ‘WORLD_SIZE’: ‘16’}
[36848] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘9’, ‘WORLD_SIZE’: ‘16’}
[36853] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘14’, ‘WORLD_SIZE’: ‘16’}
[36852] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘13’, ‘WORLD_SIZE’: ‘16’}
[36850] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘11’, ‘WORLD_SIZE’: ‘16’}
[36849] Initializing process group with: {‘MASTER_ADDR’: ‘162.157.89.143’, ‘MASTER_PORT’: ‘20051’, ‘RANK’: ‘10’, ‘WORLD_SIZE’: ‘16’}

I have tested that the port is open and communicating between the machines by starting the code on Node 1 and running telnet 162.157.89.143 20051 on Node 2.

I am not seeing at NCCL report although I have included NCLL_DEBUG. The program just hangs and nothing else happens.

I also tried to attach gdb to both the processes,

gdb -p <pid>
set logging on
thread apply all bt
# Output will be in gdb.txt file

And gdb.txt was empty.

Another strange behavior is that, when I launch the script, the processes started are not shown on nvidia-smi.

And suggestions on things I could try?

Thank you