I have 2 processes in one node, and I want to set up two different communications between them: one with gloo
to communicate tiny messages with CPU, and one with nccl
to communicate large tensors for computation. Is it possible?
When I call dist.init_process_group
for twice, once with backend="gloo"
and backend="nccl"
, I get an error saying that I cannot initialize it twice.
This is what I tried, but it does not work:
import torch.distributed as dist
import torch
dist.init_process_group(backend="gloo")
group1 = dist.new_group(backend="gloo")
data1 = torch.tensor([1] * dist.get_rank())
dist.all_reduce(data1, group=group1)
print(f"answer from rank {dist.get_rank()}: avg {data1.mean().cpu().item()}")
group2 = dist.new_group(backend="nccl")
torch.cuda.set_device(dist.get_rank())
data2 = torch.ones((1024, 1024, 1024), device=f"cuda:{dist.get_rank()}")
dist.all_reduce(data2, group=group2)
print(f"answer from rank {dist.get_rank()}: avg {data2.mean().cpu().item()}")
It prints out answer from rank 0: avg nan
, and then hangs.
It’s quite strange that this code works:
import torch.distributed as dist
import torch
dist.init_process_group(backend="gloo")
group2 = dist.new_group(backend="nccl")
torch.cuda.set_device(dist.get_rank())
data2 = torch.ones((1024, 1024, 1024), device=f"cuda:{dist.get_rank()}")
dist.all_reduce(data2, group=group2)
print(f"answer from rank {dist.get_rank()}: avg {data2.mean().cpu().item()}")
But this code does not:
import torch.distributed as dist
import torch
dist.init_process_group(backend="gloo")
group1 = dist.new_group(backend="gloo")
data1 = torch.tensor([1] * dist.get_rank())
dist.all_reduce(data1, group=group1)
print(f"answer from rank {dist.get_rank()}: avg {data1.mean().cpu().item()}")
Can I create arbitrary groups for arbitrary times?
I’m afraid that users might called dist.init_process_group
themselves, and I don’t want to interfere with them. Therefore, I want to create a process group with gloo
backend, no matter what users do to the torch.distributed
module.
Oh, there is a bug for my code: torch.tensor([1] * dist.get_rank())
should be torch.tensor([1.0 * dist.get_rank()])
. After fixing the bug, the code works fine.
1 Like