How to set up MASTER_PORT and MASTER_ADDR in slurm

In torch’s official documentation that talks about DDP, it said to set it as the following:

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

and now I am using slurm to submit sbatch jobs, in this tutorial provided by Princeton University:

it has the following setup:

export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr

But it seems even if I just keep the master_addr to localhost and master_port to 12355, and then submit it to slurm, it can still run.

Can anyone explain it a little bit more? For example, where does “12355” comes from , can it be random?


My understanding is that it just needs to be a free port and so you can just use a random large digit number since it’s unlikely to be duplicated;

export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))

just ensures that you’ll generate a (most probably) new port number for each number job.