Collective communication across nodes

Hi there,

I was wondering if there was a way to do an all_reduce between two GPUs on different nodes. For example if I have a tensor on GPU0 of machine 0 and another tensor on GPU0 of machine 1, is it possible to issue a dist.all_reduce call across the nodes using the NCCL backend.

The following code hangs:

import argparse
import torch
import torch.distributed as dist

def ProcessArgs():
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int)
    return parser.parse_args()

def MultiNodeTest(global_rank, local_rank):
    t = torch.ones(2, 2).to(local_rank)

if __name__ == '__main__':
    args = ProcessArgs()
    dist.init_process_group(backend='nccl', init_method='env://')

I am launching this simple script using the following:

On Node 0:

python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr= --master_port=12347

On Node 1:

python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr= --master_port=12347

This code runs just fine when I use gloo as my backend. It only deadlocks when I use NCCL. Any ideas what could be happening?

is it possible to issue a dist.all_reduce call across the nodes using the NCCL backend.

This is definitely supported.

Your code seems correct to me, according to the launch utility tutorial.
Just my guess, have you tried to specify world_size arg in init_process_group?

Or have you tried to use TCP as the init_method? Like init_method="tcp://{}:{}".format(args.master_addr, args.master_port)

@mrshenli Any idea?

I found a workaround by recreating a new group object with all the ranks and then passing that group to all comm calls. If I do this it doesn’t hang anymore. Not sure why just using group.WORLD as the group in the comm calls hangs. Is this a known issue / am I missing something?

If you still need more investigation, please set the NCCL environment variable NCCL_SOCKET_IFNAME (probably as “eth1”) for debugging.