How to set up MASTER_PORT and MASTER_ADDR in slurm

In torch’s official documentation that talks about DDP, it said to set it as the following:

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

and now I am using slurm to submit sbatch jobs, in this tutorial provided by Princeton University:

it has the following setup:

export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR

But it seems even if I just keep the master_addr to localhost and master_port to 12355, and then submit it to slurm, it can still run.

Can anyone explain it a little bit more? For example, where does “12355” comes from , can it be random?

Thanks!

My understanding is that it just needs to be a free port and so you can just use a random large digit number since it’s unlikely to be duplicated;

export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))

just ensures that you’ll generate a (most probably) new port number for each number job.

In your case it works because you are running it on a single node. If you ran on multiple nodes, "localhost" would resolve to the IP of each node separately and they would not know about each other’s existence since the job script, once launched by SLURM, executes the same job script for each of the allocated nodes. Once SLURM launched the job from the queue, SLURM_JOB_NODELIST contains IPs of nodes that were allocated to the job; piping its content to head -n 1 simply gives the IP of the first allocated node in the list and sets it as master.

Port number can be set random and will be okay unless you are unlucky and some other process or SLURM user uses that random port number.