Hi I have a couple of questions regarding the torch distributed implementation please bear with me.
why do we have torchrun parameters --rdzv-endpoint and (–master-addr, --master-port) at the same time. I believe --master-* options are for backward compatibility with earlier launch utilities. Am I right ?
When I run torchrun as
torchrun --nnodes 1 --nproc_per_node 3 --rdzv-endpoint=localhost:4444 --rdzv-id=7 main.py
with –rdzv-endpoint only and print the MASTER_PORT environment variable in my code I see that I get MASTER_PORT == 4444 as expected. But when I add the option --rdzv-backend=c10d to the previous command I no longer get MASTER_PORT == 4444 but a random port number (e.g. 33073) at each run different than 4444. Additionally, when I check the ports opened by the python process by running the command:
netstat -tulpn | grep python
I know see that there are more 2 ports opened (4444, and the random port (e.g. 33073)):
Why is this so ?
What is “static” rendezvous endpoint ? I see it being mentioned as name but couldn’t find an explanation. Even though “static” is the default value for --rdzv-backend, we see the torchrun examples in the documentation pass --rdzv-backend=c10d whenever they are passing --rdzv-backend. Furthermore, in the implementation (https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py), we see that passing the --standalone option does set --rdzv-backend to c10d too:
So when is it ever used ?
In https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py, we see that if --rdzv-endpoint option is provided, both the --master-addr and --master-port args are discarded:
This supports my idea of it being there for backward compatiblity but then the following question of why both the ports specified by passing the options --master-port and --rdzv-endpoint together are all opened by torchrun ?
as I have mentioned before in my question 3, almost always we pass --rdzv-backend=c10d which is makes code run the following if statement in the above image and return None for master_addr and master_port values:
if rdzv_parameters.backend != "static": return (None, None)
This discards our --rdzv-endpoint values isn’t this wrong ? I assume it should run as:
master_addr, master_port = parse_rendezvous_endpoint(endpoint, default_port=-1)
Am I missing something or is there really an issue here ?
Does MASTER_ADDR specify the address that the TCPStore would use to communicate on the master node (rank=0) or the address that the master node would use to communicate with the other processes ? If so then what does --rdzv-endpoint specify ? I though that MASTER_ADDR was used for initializing the TCPStore and --rdzv-endpoint was used for hadling the communication between the different nodes. But in the implementation I see that MASTER_ADDR is discarded for --rdzv-endpoint value (see my question 4) which contradicts my understanding. Could you clarify what these two options are used for ?
Thank you very much for your help, I really appreaciate it.