I’m trying to conduct single-node multi-CPU training on a Windows platform, here is my environment initialization code
import os
import time
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
def run(rank, size):
""" Distributed function to be implemented later. """
print("rank", rank)
print("size", size)
pass
def init_process(rank, size, fn, backend='gloo'):
""" Initialize the distributed environment. """
os.environ["MASTER_ADDR"] = "localhost"
os.environ["USE_LIBUV"] = "0"
os.environ["MASTER_PORT"] = "12355"
dist.init_process_group(backend, rank=rank, world_size=size)
fn(rank, size)
if __name__ == "__main__":
size = 2
processes = []
mp.set_start_method("spawn")
for rank in range(size):
p = mp.Process(target=init_process, args=(rank, size, run))
p.start()
processes.append(p)
for p in processes:
p.join()
which lead the following error:
[W907 22:07:29.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to [Yikai-Xu-Office]:12355 (system error: 10049: The requested address is not valid in its context
some regular check such as filewall setting and port availability has already been made and is useless. However,it can run successfully on a Mac platform, can sombody solve this for me?
Note: I must on windows and use single-machine multi-CPU mode for special reason, hence gloo seems the only choice for me while NCLL not support CPU and MPI install requires reinstall pytorch from source which is too complicated for me.
Thanks in advance!