Collective communication across nodes

lkp411 · March 26, 2021, 12:48am

Hi there,

I was wondering if there was a way to do an all_reduce between two GPUs on different nodes. For example if I have a tensor on GPU0 of machine 0 and another tensor on GPU0 of machine 1, is it possible to issue a dist.all_reduce call across the nodes using the NCCL backend.

The following code hangs:

import argparse
import torch
import torch.distributed as dist


def ProcessArgs():
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int)
    return parser.parse_args()


def MultiNodeTest(global_rank, local_rank):
    torch.cuda.set_device(local_rank)
    t = torch.ones(2, 2).to(local_rank)
    dist.all_reduce(t)


if __name__ == '__main__':
    args = ProcessArgs()
    dist.init_process_group(backend='nccl', init_method='env://')

I am launching this simple script using the following:

On Node 0:

python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=192.168.11.2 --master_port=12347 multinode.py

On Node 1:

python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=192.168.11.2 --master_port=12347 multinode.py

lkp411 · March 26, 2021, 2:40am

This code runs just fine when I use gloo as my backend. It only deadlocks when I use NCCL. Any ideas what could be happening?

wayi · March 26, 2021, 4:56am

is it possible to issue a dist.all_reduce call across the nodes using the NCCL backend.

This is definitely supported.

Your code seems correct to me, according to the launch utility tutorial.
Just my guess, have you tried to specify world_size arg in init_process_group?

Or have you tried to use TCP as the init_method? Like init_method="tcp://{}:{}".format(args.master_addr, args.master_port)

@mrshenli Any idea?

lkp411 · March 26, 2021, 8:22am

I found a workaround by recreating a new group object with all the ranks and then passing that group to all comm calls. If I do this it doesn’t hang anymore. Not sure why just using group.WORLD as the group in the comm calls hangs. Is this a known issue / am I missing something?

wayi · March 30, 2021, 9:57pm

If you still need more investigation, please set the NCCL environment variable NCCL_SOCKET_IFNAME (probably as “eth1”) for debugging.