hi,
what is the difference between torch.distributed.all_gather
and torch.distributed.all_gather_multigpu
?
they both have the same definition: Gathers tensors from the whole group in a list
.
but, torch.dist.all_gather_multigpu
has a different case usage…
*_multigpu
is supposed to work for multi-nodes. all_gather
should also work in multi-nodes…
this is the example provided in the doc for *_multigpu
with 2 nodes for all_reduce_multigpu
, but thety all work the same.
node 0
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl",
init_method="file:///distributed_test",
world_size=2,
rank=0)
tensor_list = []
for dev_idx in range(torch.cuda.device_count()):
tensor_list.append(torch.FloatTensor([1]).cuda(dev_idx))
dist.all_reduce_multigpu(tensor_list)
node 1:
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl",
init_method="file:///distributed_test",
world_size=2,
rank=1)
tensor_list = []
for dev_idx in range(torch.cuda.device_count()):
tensor_list.append(torch.FloatTensor([1]).cuda(dev_idx))
dist.all_reduce_multigpu(tensor_list)
what usage case is this?
and most importantly, why a process in a node with multigpus want to access to all devices. in this case, it is creating tensors on each device? this is not a real case…
i assume here they want to duplicate
the tensor we want to synch on each device of the local node.
in all_gather_multigpu
, the output is a list of size world_size * nbr_gpus_in_node.
so, when calling the synch function, each process will have the same output list that contains the tensor on each gpu in the world.
why we need this all_gather_multigpu
while all_gather
can already do this easily without duplicating tensor on gpus??
also, the downside of all_gather_multigpu
is that it requires that EACH NODE NEEDS TO HAVE THE SAME NUMBER OF GPUS. in practice, this is less likely to happen on clusters. in slurm, you can request 8 gpus, you can have in the same node, but the rest are dispatched over 4 nodes with 1 gpu per node…
an example on how to synch a tensor across all gpus.
thanks