In torch’s official documentation that talks about DDP, it said to set it as the following:
def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355'
and now I am using slurm to submit sbatch jobs, in this tutorial provided by Princeton University:
it has the following setup:
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4)) master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1) export MASTER_ADDR=$master_addr echo "MASTER_ADDR="$MASTER_ADDR
But it seems even if I just keep the master_addr to localhost and master_port to 12355, and then submit it to slurm, it can still run.
Can anyone explain it a little bit more? For example, where does “12355” comes from , can it be random?