there are two machine A,B. There are two containers in the two machines, namely ContainerX, ContainerY. I set node A(containerX) as master.
their IP addresses are
A : 192.168.1.1 , X:10.10.1.1
B:192.168.2.2, Y:10.10.2.2
and I add port forwarding. 1234 port access in host will forward to the container port 1234 in it.
like this
I want to run the most basic multi-node distributed parallel.
The dist init code is follow
args.rank = int(os.environ["RANK"])
args.world_size = int(os.environ['WORLD_SIZE'])
args.local_rank = int(os.environ['LOCAL_RANK'])
args.dist_backend = 'nccl'
torch.distributed.init_process_group(backend=args.dist_backend,
world_size=args.world_size, rank=args.rank)
The bash command is:
In node A(container X)
NCCL_DEBUG=INFO torchrun \
--nnodes=2 \
--node_rank=0 \
--nproc_per_node=8 \
--master_port=1234 \
--master_addr=10.10.1.1 \
ddp.py
In node B(containerY)
NCCL_DEBUG=INFO torchrun \
--nnodes=2 \
--node_rank=1 \
--nproc_per_node=8 \
--master_port=1234 \
--master_addr=192.168.2.2 \
ddp.py
But It can’t work.
I konw use --net=host
can solve the problem. But some other container method (like lxd) not support it. I want to know a more elegant way.
Only open port 1234 not correct?
I have try to modify the MASTER_ADDR in bash command (Change it to 192.168.1.1 still not work).