I have 2 gpus in one machine for example. When using DistributedDataParallel, i need to set init_process_group. In TORCH.DISTRIBUTED doc I find an example like below:
For example, if the system we use for distributed training has 2 nodes, each of which has 8 GPUs. On each of the 16 GPUs, there is a tensor that we would like to all-reduce. The following code can serve as a reference:
Code running on Node 0
import torch
import torch.distributed as dist
dist.init_process_group(backend=“nccl”,
init_method=“file:///distributed_test”,
world_size=2,
rank=0)
tensor_list = []
for dev_idx in range(torch.cuda.device_count()):
tensor_list.append(torch.FloatTensor([1]).cuda(dev_idx))
dist.all_reduce_multigpu(tensor_list)
Code running on Node 1
import torch
import torch.distributed as dist
dist.init_process_group(backend=“nccl”,
init_method=“file:///distributed_test”,
world_size=2,
rank=1)
tensor_list = []
for dev_idx in range(torch.cuda.device_count()):
tensor_list.append(torch.FloatTensor([1]).cuda(dev_idx))
dist.all_reduce_multigpu(tensor_list)
Like this example, i think if i have only one node, i should set world_size==1 and rank==0 in init_process_group. It works because i can see 2 gpus usage is about 100%. But it seems that all of the collective functions like all_reduce/all_reduce_multigpu do not work well.
If i set world_size=2 and rank=local_rank( the arg in torch.distributed.launch) in init_process_group. The collective functions work well, and I can collective tensors in different gpus. But, the gpu usage is very low, and loss.backward() runs very slow. The gpu usage drop to 0% sometimes when running loss.backward().
The environment is cuda10.2, torch1.7.1, windows with gloo backend.
So which setting is right? world_size==1 with rank==0 or world_size==2 with changing rank according to local_rank. If the first one is right, how can i collective tensors infomation in different gpus.
Thanks a lot.