Use of torch.distributed.launch on a distributed multi-node system

Hi, I want to launch 4 processes (two processes per node) on a distributed memory system
Each node in the system has 2 GPUs
So, the layout is the following:
Node 1
rank 0 on GPU:0
rank 1 on GPU:1
Node 2
rank 2 on GPU:0
rank 3 on GPU:1

I am trying to use this from pytorch documentation

I am using singularity containerization and mpiexec in a script in the following way:

First I do:

qsub -n 2 -t 5 -A myproject ./Script.sh

which ask for 2 nodes during 5 minutes,

inside the script we have the following command

mpiexec -n 4 -f $COBALT_NODEFILE singularity exec --nv -B $mscoco_path $centos_path python3.8 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=$myrank --master_addr="192.168.1.1" --master_port=1234 $cl_path --b 128 -t -v $mscoco_path

How do I get $myrank env variable in order to provide it to --node_rank as stipulated in the documentation?

Thanks!

Hey @dariodematties, is it possible to know which qsub node (node1 or node2) it is on when running Script.sh? If it is possible to get that, the node rank for the launch script can be derived from node id accordingly?

cc @Kiuk_Chung, just in case you have seen use cases like this before. :slight_smile:

@mrshenli IIUC singularity’s qsub command is invoked from the submitting process. @dariodematties I’m assuming you are using dist.init_process_group(backend="mpi") inside $mscoco_path?

Since you are already launching with mpiexec there is no need to wrap your script with the distributed launcher. See: https://github.com/pytorch/pytorch/blob/49f0e5dfeb64a928c6a2368dd5f86573b07d20fb/torch/distributed/distributed_c10d.py#L446

The rank and world size information is provided by mpi runtime.

1 Like

Thanks @mrshenli, I have tried to get the node id from Script.sh, but I have not been able to do that so far

Hi @Kiuk_Chung, thank you for your help. I think what you say makes a lot of sense
I tried what you suggest
this is the mpiexec line in Script.sh

mpiexec -n 4 -f $COBALT_NODEFILE -ppn 2 singularity exec --nv -B $mscoco_path $centos_path python3.8 $cl_path --b 128 -t -v $mscoco_path

I try to launch 4 processes in two nodes (-ppn stands for processes per node)
mscoco_path is a path for the container
centos_path is the path of the container
cl_path is the path to the python script I want to run
what follows are options
I also used dist.init_process_group(backend="mpi") as you suggested
Yet, it is not working

As I can see in the output it launch the container but never launches the python script

@dariodematties In mpiexec -n 4 -f $COBALT_NODEFILE -ppn 2 what does the -n 4 and -ppn 2 do? My understanding was that mpiexec -n 4 -f $COBALT_NODEFILE is going to invoke 4 procs on each host listed by $COBALT_NODEFILE.

Than you very much for your response @Kiuk_Chung

-n 4 means 4 processes total and -ppn 2 means 2 processes per node

I have been using this extensively in mpi+omp hybrid applications for C and C++ code and this is the behavior

mpiexec distributes the 4 processes in the nodes you asked for from qsub using -ppn

apologies for the late reply, circling back on this, were you able to get it working?

All good @Kiuk_Chung

We all are very busy in fact

I have not been able to solve it

Running on a single node I realized that it was not necessary to use mpiexec in front of torch.distributed.launch

The correct command to use in the script is the following


singularity exec --nv -B $imagenet_path $centos_path python3.8 -m torch.distributed.launch --nproc_per_node=2 $re_path --b 256 -f 2 -v $best_model_path $imagenet_path

Here I specify 2 processes per node since this node have two GPUs (--nproc_per_node=2)

I launch the scrip using the following line in the command line


qsub -n 1 -t 720 -A my-project-name ./run_Active_SimCLR.sh

Here I am asking for a single node (-n 1) for 12 hours (-t 720)

Remember that I have this running inside a container

As the documentation points out, when I want to run the same code in two nodes I use the following

singularity exec --nv -B $mscoco_path $centos_path python3.8 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_p    ort=1234 $cl_path --b 128 -t -v $mscoco_path &
singularity exec --nv -B $mscoco_path $centos_path python3.8 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr="192.168.1.1" --master_p    ort=1234 $cl_path --b 128 -t -v $mscoco_path

The new options here that the documentation specifies are the number of nodes (--nnodes=2) and the node id (--node_rank=0)

And I launch it by

qsub -n 2 -t 720 -A my-project-name ./run_Active_SimCLR.sh

Where I ask for 2 nodes during 12 hours

Using print statements I realized that the code gets stuck at this line

torch.distributed.init_process_group(backend='gloo', init_method='env://')

When I check the nodes using qstat they are still listed as running but they are like idle