PyTorch Distributed (Gloo) fails with system error: 10049 - The requested address is not valid in its context

Hi everyone,

I am trying to test basic distributed communication between two Windows PCs using torch.distributed with the gloo backend.

Here is the test script:

import sys
import torch.distributed as dist


MASTER_IP = "0.0.0.0"
PORT = "27001"
WORLD_SIZE = 2


def main():

    rank = int(sys.argv[1])

    dist.init_process_group(
        backend="gloo",
        init_method=f"tcp://{MASTER_IP}:{PORT}",
        rank=rank,
        world_size=WORLD_SIZE,
    )

    print(f"CONNECTED: rank {rank}")

    dist.barrier()

    print(f"FINISHED: rank {rank}")


if __name__ == "__main__":
    main()

I run the script using:

python test_dist.py 0

But I get the following error:

[W507 15:46:36.000000000 socket.cpp:752] [c10d] The client socket has failed to connect to [xyz.org]:27001 (system error: 10049 - The requested address is not valid in its context.)

A few things I noticed:

  • I set MASTER_IP = "0.0.0.0"

  • Even though I do not specify a hostname, PyTorch tries to connect using my machine hostname (xyz.org)

  • I am running this on Windows

  • Goal is to run DDP across two separate PCs on the same network

I also tried using my IPv4 address instead of 0.0.0.0, but the error is still not resolving.