Why does torchrun change localhost to host IP for master_addr?


I specify rdzv_endpoint as localhost:29500 in torchrun, but it resulted to the IP address of the host, and also change the port number.
I ran it with distributed.launch, it works as specified, i.e. the master_addr is not changed.

In my single node run, distributed.launch works, but torchrun doesn’t. How can I prevent torchrun to do this?

Below is the log using torchrun:

INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 2
run_id : 1
rdzv_backend : c10d
rdzv_endpoint : localhost:29500
rdzv_configs : {‘timeout’: 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /var/tmp/pbs.1649498.scinfra2/torchelastic_y5dqd8_5/1_i8wso1of
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous’ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
master_addr=xxx.xxx.xxx. com
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

@d4l3k Could you take a look?