Should the parameter nproc_per_node be equal on two different GPU nodes

I have two GPU nodes. One has two GPUs and the other has only one GPU. I want to use them for distributed training and I run with this bash code:
Node 1

python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="0.0.0.0" --master_port=3338 train_dist.py  --restore 0 --config-file configs/vgg16_nddr_additive_4_unpool_aug_shortcut_sing_cosine_dist.yaml

Node 2

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="0.0.0.0" --master_port=3338 train_dist.py  --restore 0 --config-file configs/vgg16_nddr_additive_4_unpool_aug_shortcut_sing_cosine_dist.yaml

But the program is stuck. And I have no idea about it. But when the nproc_per_node is set to 1 on both of them there is no problem. So, should the num of GPU on each distributed node always be the same? Does there have any other solutions to run the unbalanced distributed training?

This is currently a limitation of torch.distributed.launch where it assumes all nodes are symmetric. Basically, on each node it assumes the world_size is nproc_per_node * nnodes and as a result you see the hang since this is not consistent across all nodes.

Thanks a lot! I successfully solve it with mutliporcessing.spawn. Also, there is another way which means we need to re-implement launch.py.