Setting Parameters of DistributedDataParallel

Hi, I’m trying to use DistributedDataParallel to enable training on multiple GPUs on a single node. According to what I read from online documents, the procedure is the following (please correct me if it’s not true):

dist.init_process_group(backend=backend, init_method=init_method, rank=rank, world_size=world_size)
model = DistributedDataParallel(model, device_ids=[local_rank])

And then spawn the process in the main function:

multiprocessing.spawn(fn, args=args, nprocs=world_size, join=True)

I have two questions regarding this:

  • What should I set these parameters to? e.g., backend, init_method, rank, world_size… I didn’t find an example showing details of setting these parameters.

  • I saw some examples online used DistributedSampler. Is this necessary for using DistributedDataParallel?

Thanks!!

Hi,

You can take a look at https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Your params: backend is a type of communication between nodes (depends on your hardware and setup), e.g. backend = ‘nccl’; local rank is given to your function (fn) by multiprocessing.spawn(); rank is computed based on node rank and local rank; world_size is how many ranks you want to run in total (typically = number of nodes * number of gpus in a node).

DistributedSampler is useful for loading data across the ranks you’ve created by multiprocessing.spawn(), so they don’t share the same pieces of data in training.

Hi, thanks for your reply! I’m still a bit confused about the concept rank. Does it correspond to the number of gpus available?

For example, if I’m using 1 node with 4 gpus to train, how should I set those parameters?

you can think of rank as an ID of a process controlling one (preferred) or multiple GPUs.
local rank is a local ID of such process.

if you have 1 node with 4GPUs we prefer that you use DDP with 4 ranks (processes)

1 Like