Hi, I want to launch 4 processes (two processes per node) on a distributed memory system
Each node in the system has 2 GPUs
So, the layout is the following:
Node 1
rank 0 on GPU:0
rank 1 on GPU:1
Node 2
rank 2 on GPU:0
rank 3 on GPU:1
I am trying to use this from pytorch documentation
I am using singularity containerization and mpiexec in a script in the following way:
First I do:
qsub -n 2 -t 5 -A myproject ./Script.sh
which ask for 2 nodes during 5 minutes,
inside the script we have the following command
mpiexec -n 4 -f $COBALT_NODEFILE singularity exec --nv -B $mscoco_path $centos_path python3.8 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=$myrank --master_addr="192.168.1.1" --master_port=1234 $cl_path --b 128 -t -v $mscoco_path
How do I get $myrank env variable in order to provide it to --node_rank as stipulated in the documentation?
Thanks!