Torchrun --rdzv-* and --master* options related questions

Hi I have a couple of questions regarding the torch distributed implementation please bear with me.

  1. why do we have torchrun parameters --rdzv-endpoint and (–master-addr, --master-port) at the same time. I believe --master-* options are for backward compatibility with earlier launch utilities. Am I right ?

  2. When I run torchrun as

    torchrun --nnodes 1 --nproc_per_node 3  --rdzv-endpoint=localhost:4444 --rdzv-id=7 main.py
    

    with –rdzv-endpoint only and print the MASTER_PORT environment variable in my code I see that I get MASTER_PORT == 4444 as expected. But when I add the option --rdzv-backend=c10d to the previous command I no longer get MASTER_PORT == 4444 but a random port number (e.g. 33073) at each run different than 4444. Additionally, when I check the ports opened by the python process by running the command:

    netstat -tulpn | grep python
    

    I know see that there are more 2 ports opened (4444, and the random port (e.g. 33073)):


    Why is this so ?

  3. What is “static” rendezvous endpoint ? I see it being mentioned as name but couldn’t find an explanation. Even though “static” is the default value for --rdzv-backend, we see the torchrun examples in the documentation pass --rdzv-backend=c10d whenever they are passing --rdzv-backend. Furthermore, in the implementation (https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py), we see that passing the --standalone option does set --rdzv-backend to c10d too:


    So when is it ever used ?

  4. In https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py, we see that if --rdzv-endpoint option is provided, both the --master-addr and --master-port args are discarded:


    This supports my idea of it being there for backward compatiblity but then the following question of why both the ports specified by passing the options --master-port and --rdzv-endpoint together are all opened by torchrun ?

  5. In https://github.com/pytorch/pytorch/blob/main/torch/distributed/launcher/api.py#L99, we have:


    as I have mentioned before in my question 3, almost always we pass --rdzv-backend=c10d which is makes code run the following if statement in the above image and return None for master_addr and master_port values:

    if rdzv_parameters.backend != "static":
            return (None, None)
    

    This discards our --rdzv-endpoint values isn’t this wrong ? I assume it should run as:

    master_addr, master_port = parse_rendezvous_endpoint(endpoint, default_port=-1)
    

    Am I missing something or is there really an issue here ?

  6. Does MASTER_ADDR specify the address that the TCPStore would use to communicate on the master node (rank=0) or the address that the master node would use to communicate with the other processes ? If so then what does --rdzv-endpoint specify ? I though that MASTER_ADDR was used for initializing the TCPStore and --rdzv-endpoint was used for hadling the communication between the different nodes. But in the implementation I see that MASTER_ADDR is discarded for --rdzv-endpoint value (see my question 4) which contradicts my understanding. Could you clarify what these two options are used for ?

Thank you very much for your help, I really appreaciate it.