Setting Parameters of DistributedDataParallel

Shen · November 5, 2020, 10:47am

Hi, I’m trying to use DistributedDataParallel to enable training on multiple GPUs on a single node. According to what I read from online documents, the procedure is the following (please correct me if it’s not true):

dist.init_process_group(backend=backend, init_method=init_method, rank=rank, world_size=world_size)
model = DistributedDataParallel(model, device_ids=[local_rank])

And then spawn the process in the main function:

multiprocessing.spawn(fn, args=args, nprocs=world_size, join=True)

I have two questions regarding this:

What should I set these parameters to? e.g., backend, init_method, rank, world_size… I didn’t find an example showing details of setting these parameters.
I saw some examples online used DistributedSampler. Is this necessary for using DistributedDataParallel?

Thanks!!

agolynski · November 9, 2020, 10:01pm

Hi,

You can take a look at https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Your params: backend is a type of communication between nodes (depends on your hardware and setup), e.g. backend = ‘nccl’; local rank is given to your function (fn) by multiprocessing.spawn(); rank is computed based on node rank and local rank; world_size is how many ranks you want to run in total (typically = number of nodes * number of gpus in a node).

DistributedSampler is useful for loading data across the ranks you’ve created by multiprocessing.spawn(), so they don’t share the same pieces of data in training.

Shen · November 10, 2020, 10:06am

Hi, thanks for your reply! I’m still a bit confused about the concept rank. Does it correspond to the number of gpus available?

For example, if I’m using 1 node with 4 gpus to train, how should I set those parameters?

agolynski · November 10, 2020, 3:21pm

you can think of rank as an ID of a process controlling one (preferred) or multiple GPUs.
local rank is a local ID of such process.

if you have 1 node with 4GPUs we prefer that you use DDP with 4 ranks (processes)