Potential issue of "errno: 98- Address already in use" error in DDP (with torchrun)

During the use of torch run (with ddp), sometimes there may be random occurrences of ‘errno: 98- Address already in use’, for example:

[W socket.cpp:436] [c10d] The server socket has failed to bind to [::]:29400 (errno: 98 - Address already in use).
[W socket.cpp:436] [c10d] The server socket has failed to bind to 0.0.0.0:29400 (errno: 98 - Address already in use).
[E socket.cpp:472] [c10d] The server socket has failed to listen on any local network address.

There is a similar issue on Github: issue 85604

Both the issue 85604 mentioned above or other answers searched online, the explanations provided are mostly:

  • “Perhaps the previous pytorch training task was not completely exited, master_port is still being occupied”
  • “(for unknown reasons), port is being occupied”.

And The solution is : “provide a new and free master-port”

When I was reading the source code of pytorch torch, I found that there were two places where TCPStore and master port were used.

  1. create_c10_backend, where using master post in and creating the first TCPStore.

In my options, the TCPStore is used to store Rendezvous information. It’s only used to synchronize the status information of multiple ElasticAgents and is not actually related to DDP.

There may be a problem with the ‘Address already in use’ mentioned above. I can find a free port and solve it by specifying --rdzv_endpoint=${host_ip}:{free_port}.

And after this first TCPStore is created, in _initialize_workers, a new master-port is find by _get_socket_with_port(and this master-port is unable to specify by passing args parameters or environment variables outside. And later, two key values, MASTER_ADDR and MASTER_PORT (as environment variable?), will be stored in the first TCPStore, see _set_master_addr_port.

  1. The second creation of TCPStore is in dist init_process_group, which finally call _env_rendezvous_handler to create the TCPStore. And in _env_rendezvous_handler:
def _env_rendezvous_handler(
    url: str, timeout: timedelta = default_pg_timeout, **kwargs
):
    ...
    
    # find master_addr and master_port from environment variable
    master_addr = _get_env_or_raise("MASTER_ADDR")
    master_port = int(_get_env_or_raise("MASTER_PORT"))
    ...
    
    # 
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)

    yield (store, rank, world_size)

We can see that the second TCPStore in DDP will default to use the MASTER_PORT which is found by socket and set to environment before.

And there may also be errno: 98- Address already in use issues in _create_c10d_store.

There may be the following reasons:

  1. It is possible that the master-port in the _set_master_addr_port has not been fully released, perhaps due to TIMP_WAIT? I’m not quite sure. But it maybe cause the error when using the same master-port in _create_c10d_store;
  2. The master-port in the _set_master_addr_port has been completely released, but it has been occupied by other processes in the system before being reused in _create_c10d_store, which can also cause problems.

To verify the reason, I conducted a test. Use both socket.bind , ss -tlnp | grep <port>, lsof -i:<port> to check if the port is occupied, like in _create_c10d_store:


def is_port_in_use(port, host='127.0.0.1', rank=-1):  
            
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:  
        try:  
            s.bind((host, port))  
            in_use = False  
        except socket.error as e:
            print(f"---> socket error: {e}")
            in_use = True
                    
    if in_use:
        print(f"===> rank {rank}, socket check, port {port} is in use")
    else:
        print(f"===> rank {rank}, socket check, port {port} is in free")

def _create_c10d_store(hostname, port, rank, world_size, timeout, use_libuv=False) -> Store:

    if not 0 <= port < 2**16:
        raise ValueError(f"port must have value from 0 to 65535 but was {port}.")

    if _torchelastic_use_agent_store():
        attempt = os.environ["TORCHELASTIC_RESTART_COUNT"]
        tcp_store = TCPStore(hostname, port, world_size, False, timeout)
        return PrefixStore(f"/worker/attempt_{attempt}", tcp_store)
    else:
        start_daemon = rank == 0
        
        # check
        is_port_in_use(port, hostname, rank)
        
        return TCPStore(
            hostname, port, world_size, start_daemon, timeout, multi_tenant=True, use_libuv=use_libuv
        )

The test results are as follows:

  • port free, and the running was successful:
===> rank 0, socket check, port 39995 is in free
===> rank 1, socket check, port 39995 is in free

# no error, success
  • port in use, running fail
===> rank 0, socket check, port 48763 is in use
===> rank 1, socket check, port 48763 is in use

# failed
[W socket.cpp:436] [c10d] The server socket has failed to bind to [::]:48763 (errno: 98 - Address already in use).
[W socket.cpp:436] [c10d] The server socket has failed to bind to 0.0.0.0:48763 (errno: 98 - Address already in use).
[E socket.cpp:472] [c10d] The server socket has failed to listen on any local network address.
  • port in use, running successful (confused???)
===> rank 0, socket check, port 37956 is in use
===> rank 1, socket check, port 37956 is in use

# no error, success

I have the following questions:

Question1:
What is the true reason for the errno: 98- Address already in use issues in _create_c10d_store?

Question2:
Why is the socket.bind detection port being used, but TCPStore using the same port does not raise error? (Is there something wrong with my testing method or did I overlook anything?)

Question3:
How to solve or avoid this problem from an engineering perspective? (Instead of running again after discovering errors)