Hi, I’m trying to use DistributedDataParallel to enable training on multiple GPUs on a single node. According to what I read from online documents, the procedure is the following (please correct me if it’s not true):
dist.init_process_group(backend=backend, init_method=init_method, rank=rank, world_size=world_size)
model = DistributedDataParallel(model, device_ids=[local_rank])
What should I set these parameters to? e.g., backend, init_method, rank, world_size… I didn’t find an example showing details of setting these parameters.
I saw some examples online used DistributedSampler. Is this necessary for using DistributedDataParallel?
Your params: backend is a type of communication between nodes (depends on your hardware and setup), e.g. backend = ‘nccl’; local rank is given to your function (fn) by multiprocessing.spawn(); rank is computed based on node rank and local rank; world_size is how many ranks you want to run in total (typically = number of nodes * number of gpus in a node).
DistributedSampler is useful for loading data across the ranks you’ve created by multiprocessing.spawn(), so they don’t share the same pieces of data in training.