How can I have both gloo and nccl backend in torch.distributed?

youkaichao1 · March 27, 2024, 6:39pm

I have 2 processes in one node, and I want to set up two different communications between them: one with gloo to communicate tiny messages with CPU, and one with nccl to communicate large tensors for computation. Is it possible?

When I call dist.init_process_group for twice, once with backend="gloo" and backend="nccl", I get an error saying that I cannot initialize it twice.

youkaichao1 · March 27, 2024, 6:52pm

This is what I tried, but it does not work:

import torch.distributed as dist
import torch
dist.init_process_group(backend="gloo")

group1 = dist.new_group(backend="gloo")
data1 = torch.tensor([1] * dist.get_rank())
dist.all_reduce(data1, group=group1)
print(f"answer from rank {dist.get_rank()}: avg {data1.mean().cpu().item()}")

group2 = dist.new_group(backend="nccl")
torch.cuda.set_device(dist.get_rank())
data2 = torch.ones((1024, 1024, 1024), device=f"cuda:{dist.get_rank()}")
dist.all_reduce(data2, group=group2)
print(f"answer from rank {dist.get_rank()}: avg {data2.mean().cpu().item()}")

It prints out answer from rank 0: avg nan , and then hangs.

youkaichao1 · March 29, 2024, 3:58am

It’s quite strange that this code works:

import torch.distributed as dist
import torch
dist.init_process_group(backend="gloo")
group2 = dist.new_group(backend="nccl")
torch.cuda.set_device(dist.get_rank())
data2 = torch.ones((1024, 1024, 1024), device=f"cuda:{dist.get_rank()}")
dist.all_reduce(data2, group=group2)
print(f"answer from rank {dist.get_rank()}: avg {data2.mean().cpu().item()}")

But this code does not:

import torch.distributed as dist
import torch
dist.init_process_group(backend="gloo")

group1 = dist.new_group(backend="gloo")
data1 = torch.tensor([1] * dist.get_rank())
dist.all_reduce(data1, group=group1)
print(f"answer from rank {dist.get_rank()}: avg {data1.mean().cpu().item()}")

Can I create arbitrary groups for arbitrary times?

I’m afraid that users might called dist.init_process_group themselves, and I don’t want to interfere with them. Therefore, I want to create a process group with gloo backend, no matter what users do to the torch.distributed module.

youkaichao1 · March 30, 2024, 6:04am

Oh, there is a bug for my code: torch.tensor([1] * dist.get_rank()) should be torch.tensor([1.0 * dist.get_rank()]) . After fixing the bug, the code works fine.