How to specify MASTER_ADDR and worker ID's for RPC?

ajayp1 · February 1, 2021, 12:48am

I am working with a CPU cluster, where I request a number of nodes to perform model parallelism. I must request jobs and run scripts via slurm. I am unsure how to specify variables such as MASTER_ADDR and MASTER_PORT, and am not sure if I should/how to use the host names of nodes as Worker ID’s (I think I can get these names from the sinfo command).

Also, the worke nodes I’m working with are not connected to the internet, so I’m unsure if I can even specify a MASTER_ADDR/IP address? (I will confirm with the staff)

If anyone has implemented RPC on a cluster with slurm - how did you specific the address/port of the master node and worker id’s of the other nodes? I surely can’t be the first in this situation. I found a previous post that simply recommended asking the service provider, but it would be nice to know if anyone has handled this issue.

mrshenli · February 1, 2021, 4:04pm

Hey @ajayp1, I am not familiar with slurm, but RPC’s MASTER_ADDR and MASTER_PORT should be similar to the ones used by collective communications and DistributedDataParallel (DDP) in init_process_group. Have you previously used DDP with slurm?

Found an example here: Multi-node-training on slurm with PyTorch · GitHub

ajayp1 · February 1, 2021, 11:32pm

Hi @mrshenli,

Thanks for that example, I was looking all over online for something like that but couldn’t find one.

If anyone else is curious, I used a similar slurm variable to get the worker ID’s and wrote them to a file in my shell script:

touch /dev/null > $WORKDIR/mynodes.txt
scontrol show hostname $SLURM_JOB_NODELIST > $WORKDIR/mynodes.txt

Then I simply read from this file in my pytorch script.

To get the rank 0 IP address, I did os.environ['MASTER_ADDR'] = os.system(nodes_list[0] + ' -I awk {print$1}') after reading the rank 0 node ID from my written file. For my specific cluster, the port number didn’t matter, I just picked 8080.

For data parallelism, I’m planning on simply using the distributed library which requires the same MASTER_ADDR, so not expecting a problem there.

mrshenli · February 2, 2021, 7:54pm

Hey @Kiuk_Chung, is there any recommended way to launch with slurm?

ajayp1 · February 2, 2021, 8:48pm

Actually, after reviewing the output file for the RPC implementation, it looks like each node acted as the master, despite the if/else statement to check for rank 0 as per the RPC model parallelism tutorial.

My code is here: GitHub - ajayp1/Distributed-Torch-testing: test scripts for trying out PyTorch's distributed libraries

Separate but related issue: when I run the MNIST example from this tutorial, I’m not able to broadcast the MASTER_ADDR and PORT variables to the worker nodes. I managed to at least set the variables on the master node by putting the assignment under if __name__ == "__main__":, but it still doesn’t broadcast to other nodes. Am I supposed to use a certain launcher so that e.g. MPI can communicate between nodes?

My code for this example using just torch.distributed is in the same repo, files have MNIST in the title.

Kiuk_Chung · February 4, 2021, 8:00pm

@mrshenli unfortunately we don’t have specific instructions to launch in SLURM. Having said that torchelastic is scheduler agnostic - it uses rendezvous id to “form membership”. So if you follow the instructions in here: Quickstart — PyTorch/Elastic master documentation it should work regardless of the scheduler.