Hi everyone,
I am trying to test basic distributed communication between two Windows PCs using torch.distributed with the gloo backend.
Here is the test script:
import sys
import torch.distributed as dist
MASTER_IP = "0.0.0.0"
PORT = "27001"
WORLD_SIZE = 2
def main():
rank = int(sys.argv[1])
dist.init_process_group(
backend="gloo",
init_method=f"tcp://{MASTER_IP}:{PORT}",
rank=rank,
world_size=WORLD_SIZE,
)
print(f"CONNECTED: rank {rank}")
dist.barrier()
print(f"FINISHED: rank {rank}")
if __name__ == "__main__":
main()
I run the script using:
python test_dist.py 0
But I get the following error:
[W507 15:46:36.000000000 socket.cpp:752] [c10d] The client socket has failed to connect to [xyz.org]:27001 (system error: 10049 - The requested address is not valid in its context.)
A few things I noticed:
-
I set
MASTER_IP = "0.0.0.0" -
Even though I do not specify a hostname, PyTorch tries to connect using my machine hostname (
xyz.org) -
I am running this on Windows
-
Goal is to run DDP across two separate PCs on the same network
I also tried using my IPv4 address instead of 0.0.0.0, but the error is still not resolving.