Ddp: diff between dist.all_gather and dist.all_gather_multigpu?

sbelharbi · August 30, 2021, 2:07pm

hi,
what is the difference between torch.distributed.all_gather and torch.distributed.all_gather_multigpu?
they both have the same definition: Gathers tensors from the whole group in a list.
but, torch.dist.all_gather_multigpu has a different case usage…
*_multigpu is supposed to work for multi-nodes. all_gather should also work in multi-nodes…
this is the example provided in the doc for *_multigpu with 2 nodes for all_reduce_multigpu, but thety all work the same.

node 0

import torch
import torch.distributed as dist

dist.init_process_group(backend="nccl",
                        init_method="file:///distributed_test",
                        world_size=2,
                        rank=0)
tensor_list = []
for dev_idx in range(torch.cuda.device_count()):
    tensor_list.append(torch.FloatTensor([1]).cuda(dev_idx))

dist.all_reduce_multigpu(tensor_list)

node 1:

import torch
import torch.distributed as dist

dist.init_process_group(backend="nccl",
                        init_method="file:///distributed_test",
                        world_size=2,
                        rank=1)
tensor_list = []
for dev_idx in range(torch.cuda.device_count()):
    tensor_list.append(torch.FloatTensor([1]).cuda(dev_idx))

dist.all_reduce_multigpu(tensor_list)

what usage case is this?
and most importantly, why a process in a node with multigpus want to access to all devices. in this case, it is creating tensors on each device? this is not a real case…

i assume here they want to duplicate the tensor we want to synch on each device of the local node.
in all_gather_multigpu, the output is a list of size world_size * nbr_gpus_in_node.
so, when calling the synch function, each process will have the same output list that contains the tensor on each gpu in the world.

why we need this all_gather_multigpu while all_gather can already do this easily without duplicating tensor on gpus??

also, the downside of all_gather_multigpu is that it requires that EACH NODE NEEDS TO HAVE THE SAME NUMBER OF GPUS. in practice, this is less likely to happen on clusters. in slurm, you can request 8 gpus, you can have in the same node, but the rest are dispatched over 4 nodes with 1 gpu per node…

an example on how to synch a tensor across all gpus.

thanks

gcramer23 · August 30, 2021, 10:56pm

A single tensor is broadcast from a process when using all_gather. A list of tensors is broadcast from a process when using all_gather_multigpu.

what usage case is this?
and most importantly, why a process in a node with multigpus want to access to all devices. in this case, it is creating tensors on each device? this is not a real case…

Its an example of using the PyTorch API.

also, the downside of all_gather_multigpu is that it requires that EACH NODE NEEDS TO HAVE THE SAME NUMBER OF GPUS. in practice, this is less likely to happen on clusters. in slurm, you can request 8 gpus, you can have in the same node, but the rest are dispatched over 4 nodes with 1 gpu per node…

You would simply need to configure your resources correctly.

sbelharbi · August 30, 2021, 11:32pm

not sure about that.
they are both used to sync one tensor.
the output of both methods is a list of tensors where each tensor comes from a process.
then, you can merge this list as you like to get the final synched tensor as in here.
broadcasting is done using torch.dist.broadcast i think, and it copies a tensor from source and diffuses it to the rest. all_gather copies all tensors across all process into a list and makes sure that all processes have the same exact list.

gcramer23 · August 30, 2021, 11:49pm

they are both used to sync one tensor.

all_gather https://github.com/pytorch/pytorch/blob/101a6263309ae2f9e52947c7d02d630e1190b6c3/torch/distributed/distributed_c10d.py#L1940

all_gather_multigpu https://github.com/pytorch/pytorch/blob/101a6263309ae2f9e52947c7d02d630e1190b6c3/torch/distributed/distributed_c10d.py#L1450

omsrisagar · March 14, 2022, 10:33pm

I think multigpu versions are needed when you run a single process for all the GPUs in that node. Most commonly we run one process per GPU, in which case dist.all_gather will be sufficient to gather tensors from different GPUs in different nodes. In case we run only one process for all the GPUs in a given node (as in the example code at Distributed communication package - torch.distributed — PyTorch 1.11.0 documentation), we need to make use of dist.all_gather_multigpu to aggregate data from all the GPUs—because in this case each rank has 8 GPUs under it.
You can see in this example code that the world_size is 2, which means that there are only two processes, one on each node, covering the 16 GPUs. In a general implementation, we create 16 processes in total (8 on one node and 8 on another node), using torch.multiprocessing.spawn().