Launching with mpirun instead of torchrun

I am trying to launch a DDP training on 2 GPUs that are located on 2 different nodes and interconnected via LAN. I would like to use mpirun instead of torchrun, what changes should I do?

I already changed the env variables like this:

# from mpirun
print("Launching with mpirun")
rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
local_rank = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])

dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)

I used to launch the training with no problems with the following command from both nodes just changing the rank from 0 to 1 in the slave:

torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=xxx.xxx.xxx.xxx --master_port=1234 train_torch.py --train_args

But when I try to launch from the master with mpirun like this:

export MASTER_ADDR=xxx.xxx.xxx.xxx
export MASTER_PORT=1234
mpirun -np 2 -host xxx.xxx.xxx.xxx,xxx.xxx.xxx.xxy --map-by node python train_torch.py

It just hangs on and never starts to parse the script.

Knowing where it’s hanging will help figuring out the solution.

Try running with the env var TORCH_DISTRIBUTED_DEBUG set to INFO or DETAIL.

It gives no information as it gets stuck after printing “No Protocol specified”.
It seems like it is waiting for the second node to do something but I don’t know how to solve, it should work by just launching on the master, and I’m facing the same exact problem when trying to use DeepSpeed