Hi there,
I was wondering if there was a way to do an all_reduce between two GPUs on different nodes. For example if I have a tensor on GPU0 of machine 0 and another tensor on GPU0 of machine 1, is it possible to issue a dist.all_reduce call across the nodes using the NCCL backend.
The following code hangs:
import argparse
import torch
import torch.distributed as dist
def ProcessArgs():
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
return parser.parse_args()
def MultiNodeTest(global_rank, local_rank):
torch.cuda.set_device(local_rank)
t = torch.ones(2, 2).to(local_rank)
dist.all_reduce(t)
if __name__ == '__main__':
args = ProcessArgs()
dist.init_process_group(backend='nccl', init_method='env://')
I am launching this simple script using the following:
On Node 0:
python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=192.168.11.2 --master_port=12347 multinode.py
On Node 1:
python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=192.168.11.2 --master_port=12347 multinode.py