Should the parameter nproc_per_node be equal on two different GPU nodes

slyviacassell · January 14, 2021, 9:14am

I have two GPU nodes. One has two GPUs and the other has only one GPU. I want to use them for distributed training and I run with this bash code:
Node 1

python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="0.0.0.0" --master_port=3338 train_dist.py  --restore 0 --config-file configs/vgg16_nddr_additive_4_unpool_aug_shortcut_sing_cosine_dist.yaml

Node 2

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="0.0.0.0" --master_port=3338 train_dist.py  --restore 0 --config-file configs/vgg16_nddr_additive_4_unpool_aug_shortcut_sing_cosine_dist.yaml

But the program is stuck. And I have no idea about it. But when the nproc_per_node is set to 1 on both of them there is no problem. So, should the num of GPU on each distributed node always be the same? Does there have any other solutions to run the unbalanced distributed training?

pritamdamania87 · January 15, 2021, 3:17am

This is currently a limitation of torch.distributed.launch where it assumes all nodes are symmetric. Basically, on each node it assumes the world_size is nproc_per_node * nnodes and as a result you see the hang since this is not consistent across all nodes.

slyviacassell · January 15, 2021, 3:31am

Thanks a lot! I successfully solve it with mutliporcessing.spawn. Also, there is another way which means we need to re-implement launch.py.

logicShu · September 14, 2021, 1:59am

Perhaps this answer should be revised?
I read the source code and found that this has now been changed.

github.com

pytorch/pytorch/blob/master/torch/distributed/elastic/agent/server/api.py#L580

    
      
              start_idx: int = 0,
              end_idx: int = -1,
          ) -> Tuple[int, List[int]]:
              if end_idx == -1:
                  end_idx = len(role_infos)
              prefix_sum = 0
              total_sum = 0
              for idx in range(start_idx, end_idx):
                  if role_idx > idx:
                      prefix_sum += role_infos[idx].local_world_size
                  total_sum += role_infos[idx].local_world_size
              return (
                  total_sum,
                  list(range(prefix_sum, prefix_sum + role_infos[role_idx].local_world_size)),
              )
          
          
# pyre-fixme[56]: Pyre was not able to infer the type of the decorator
          #  `torch.distributed.elastic.metrics.prof`.
          @prof
          def _assign_worker_ranks(
              self, store, group_rank: int, group_world_size: int, spec: WorkerSpec