Launching with mpirun instead of torchrun

santurini · May 6, 2023, 12:07pm

I am trying to launch a DDP training on 2 GPUs that are located on 2 different nodes and interconnected via LAN. I would like to use mpirun instead of torchrun, what changes should I do?

I already changed the env variables like this:

# from mpirun
print("Launching with mpirun")
rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
local_rank = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])

dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)

I used to launch the training with no problems with the following command from both nodes just changing the rank from 0 to 1 in the slave:

torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=xxx.xxx.xxx.xxx --master_port=1234 train_torch.py --train_args

But when I try to launch from the master with mpirun like this:

export MASTER_ADDR=xxx.xxx.xxx.xxx
export MASTER_PORT=1234
mpirun -np 2 -host xxx.xxx.xxx.xxx,xxx.xxx.xxx.xxy --map-by node python train_torch.py

It just hangs on and never starts to parse the script.

kumpera · May 8, 2023, 3:13pm

Knowing where it’s hanging will help figuring out the solution.

Try running with the env var TORCH_DISTRIBUTED_DEBUG set to INFO or DETAIL.

santurini · May 11, 2023, 7:31am

It gives no information as it gets stuck after printing “No Protocol specified”.
It seems like it is waiting for the second node to do something but I don’t know how to solve, it should work by just launching on the master, and I’m facing the same exact problem when trying to use DeepSpeed