How to run torch DDP model with multiple gpus in one machine?

I have 2 gpus in one machine for example. When using DistributedDataParallel, i need to set init_process_group. In TORCH.DISTRIBUTED doc I find an example like below:

For example, if the system we use for distributed training has 2 nodes, each of which has 8 GPUs. On each of the 16 GPUs, there is a tensor that we would like to all-reduce. The following code can serve as a reference:

Code running on Node 0

import torch
import torch.distributed as dist

dist.init_process_group(backend=“nccl”,
init_method=“file:///distributed_test”,
world_size=2,
rank=0)
tensor_list = []
for dev_idx in range(torch.cuda.device_count()):
tensor_list.append(torch.FloatTensor([1]).cuda(dev_idx))

dist.all_reduce_multigpu(tensor_list)
Code running on Node 1

import torch
import torch.distributed as dist

dist.init_process_group(backend=“nccl”,
init_method=“file:///distributed_test”,
world_size=2,
rank=1)
tensor_list = []
for dev_idx in range(torch.cuda.device_count()):
tensor_list.append(torch.FloatTensor([1]).cuda(dev_idx))

dist.all_reduce_multigpu(tensor_list)

Like this example, i think if i have only one node, i should set world_size==1 and rank==0 in init_process_group. It works because i can see 2 gpus usage is about 100%. But it seems that all of the collective functions like all_reduce/all_reduce_multigpu do not work well.

If i set world_size=2 and rank=local_rank( the arg in torch.distributed.launch) in init_process_group. The collective functions work well, and I can collective tensors in different gpus. But, the gpu usage is very low, and loss.backward() runs very slow. The gpu usage drop to 0% sometimes when running loss.backward().

The environment is cuda10.2, torch1.7.1, windows with gloo backend.

So which setting is right? world_size==1 with rank==0 or world_size==2 with changing rank according to local_rank. If the first one is right, how can i collective tensors infomation in different gpus.

Thanks a lot.

Hi, world_size = 1 and rank = 0 wouldn’t really work for distributed training as we generally want to train with > 1 GPU (i.e world size > 1). In particular # of GPUs (not necessarily # of nodes) is usually used as the world size.

Regarding slowness in loss.backward(), can you provide a repro of that (it appears that your code snippet is an allreduce above)? In general, loss.backward() will trigger additional allreduces during the backwards pass to synchronize parameter gradients, but especially for 2 GPUs we don’t expect this to add significant overhead.