I have two GPU nodes. One has two GPUs and the other has only one GPU. I want to use them for distributed training and I run with this bash code:
Node 1
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="0.0.0.0" --master_port=3338 train_dist.py --restore 0 --config-file configs/vgg16_nddr_additive_4_unpool_aug_shortcut_sing_cosine_dist.yaml
Node 2
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="0.0.0.0" --master_port=3338 train_dist.py --restore 0 --config-file configs/vgg16_nddr_additive_4_unpool_aug_shortcut_sing_cosine_dist.yaml
But the program is stuck. And I have no idea about it. But when the nproc_per_node
is set to 1 on both of them there is no problem. So, should the num of GPU on each distributed node always be the same? Does there have any other solutions to run the unbalanced distributed training?