I am trying to launch a DDP training on 2 GPUs that are located on 2 different nodes and interconnected via LAN. I would like to use mpirun instead of torchrun, what changes should I do?
I already changed the env variables like this:
# from mpirun
print("Launching with mpirun")
rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
local_rank = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])
dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)
I used to launch the training with no problems with the following command from both nodes just changing the rank from 0 to 1 in the slave:
torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=xxx.xxx.xxx.xxx --master_port=1234 train_torch.py --train_args
But when I try to launch from the master with mpirun like this:
export MASTER_ADDR=xxx.xxx.xxx.xxx
export MASTER_PORT=1234
mpirun -np 2 -host xxx.xxx.xxx.xxx,xxx.xxx.xxx.xxy --map-by node python train_torch.py
It just hangs on and never starts to parse the script.