I am working with a CPU cluster, where I request a number of nodes to perform model parallelism. I must request jobs and run scripts via slurm. I am unsure how to specify variables such as MASTER_ADDR and MASTER_PORT, and am not sure if I should/how to use the host names of nodes as Worker ID’s (I think I can get these names from the
Also, the worke nodes I’m working with are not connected to the internet, so I’m unsure if I can even specify a MASTER_ADDR/IP address? (I will confirm with the staff)
If anyone has implemented RPC on a cluster with slurm - how did you specific the address/port of the master node and worker id’s of the other nodes? I surely can’t be the first in this situation. I found a previous post that simply recommended asking the service provider, but it would be nice to know if anyone has handled this issue.