Thanks a lot @mrshenli for these responses! Look like it should be in some official guide
@mrshenli I’m having troubles launching the most basic two-node distributed configuration (I checked, TCP connection with nc works ok). It doesn’t seem to respect the passed port. If you could take a look, it would be awesome!
UPD: I created an issue to discuss this: https://github.com/pytorch/pytorch/issues/44544
import os
import torch
import argparse
import torch.distributed as dist
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--backend', default='gloo')
parser.add_argument('--rank', type=int, default=0)
parser.add_argument('--world-size', type=int, default=1)
args = parser.parse_args()
dist.init_process_group(args.backend, init_method="env://", rank=args.rank, world_size=args.world_size)
print(f"Master node {os.environ['MASTER_ADDR']}:{os.environ['MASTER_PORT']}. Rank {args.rank}. World size: {args.world_size}")
test_tensor = torch.tensor(args.rank+1)
if args.backend == 'nccl':
test_tensor = test_tensor.cuda()
dist.all_reduce(test_tensor, op=dist.ReduceOp.SUM)
print(f"Test value: {test_tensor.item()}, expected: {sum(range(args.world_size+1))}")
Test standalone
Node 1
Input:
GLOO_SOCKET_IFNAME=team0 MASTER_ADDR=10.81.13.54 MASTER_PORT=12345 python distributed_example.py
Output:
Master node 10.81.13.54:12345. Rank 0. World size: 1
Test value: 1, expected: 1
Node 2
Input:
GLOO_SOCKET_IFNAME=team0 MASTER_ADDR=10.81.13.51 MASTER_PORT=12345 python distributed_example.py
Output:
Master node 10.81.13.51:12345. Rank 0. World size: 1
Test value: 1, expected: 1
Test disctibuted Gloo
Node 1
Input:
GLOO_SOCKET_IFNAME=team0 MASTER_ADDR=10.81.13.54 MASTER_PORT=12345 python distributed_example.py --rank 0 --world-size 2
Output:
Traceback (most recent call last):
File "distributed_example.py", line 11, in <module>
dist.init_process_group('gloo', init_method="env://", rank=args.rank, world_size=args.world_size)
File "/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 425, in init_process_group
_default_pg = _new_process_group_helper(
File "/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 499, in _new_process_group_helper
pg = ProcessGroupGloo(
RuntimeError: [/opt/conda/conda-bld/pytorch_1595629411241/work/third_party/gloo/gloo/transport/tcp/pair.cc:769] connect [10.81.13.51]:11169: No route to host
Node 2
Input:
GLOO_SOCKET_IFNAME=team0 MASTER_ADDR=10.81.13.54 MASTER_PORT=12345 python distributed_example.py --rank 1 --world-size 2
Output:
Test distributed NCCL
Node 1
Input:
NCCL_SOCKET_IFNAME=team0 NCCL_DEBUG=INFO MASTER_ADDR=10.81.13.54 MASTER_PORT=12345 python distributed_example.py --rank 0 --world-size 2 --backend nccl
Output:
Master node 10.81.13.54:12345. Rank 0. World size: 2
srs-ds11:64570:64570 [0] NCCL INFO Bootstrap : Using [0]team0:10.81.13.54<0>
srs-ds11:64570:64570 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
srs-ds11:64570:64570 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
srs-ds11:64570:64570 [0] NCCL INFO NET/Socket : Using [0]team0:10.81.13.54<0>
NCCL version 2.4.8+cuda10.1
srs-ds11:64570:65266 [0] NCCL INFO Setting affinity for GPU 0 to 55,55555555
Node 2
Input:
NCCL_SOCKET_IFNAME=team0 NCCL_DEBUG=INFO MASTER_ADDR=10.81.13.54 MASTER_PORT=12345 python distributed_example.py --rank 1 --world-size 2 --backend nccl
Output:
Master node 10.81.13.54:12345. Rank 1. World size: 2
srs-ds8:192240:192240 [0] NCCL INFO Bootstrap : Using [0]team0:10.81.13.51<0>
srs-ds8:192240:192240 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
srs-ds8:192240:192240 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
srs-ds8:192240:192240 [0] NCCL INFO NET/Socket : Using [0]team0:10.81.13.51<0>
srs-ds8:192240:192316 [0] NCCL INFO Setting affinity for GPU 0 to 55,55555555
srs-ds8:192240:192316 [0] include/socket.h:390 NCCL WARN Connect to 10.81.13.54<34419> failed : No route to host
srs-ds8:192240:192316 [0] NCCL INFO bootstrap.cc:100 -> 2
srs-ds8:192240:192316 [0] NCCL INFO bootstrap.cc:326 -> 2
srs-ds8:192240:192316 [0] NCCL INFO init.cc:695 -> 2
srs-ds8:192240:192316 [0] NCCL INFO init.cc:951 -> 2
srs-ds8:192240:192316 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
Traceback (most recent call last):
File "distributed_example.py", line 17, in <module>
dist.all_reduce(test_tensor, op=dist.ReduceOp.SUM)
File "/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 936, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629411241/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled system error, NCCL version 2.4.8