I am running a script for distributed processing on windows with 1 GPU. For that, I am using torch.distributed. However, I am confused how to set the dist_url flag. Among many others, I went through this documentation but couldn’t get it.
When I used args.dist_url = 'tcp://localhost:58472' or args.dist_url = 'env://' or even when I manually calculate my IP address and a free port and then set the value as: args.dist_url = "tcp://{}:{}".format(ip,port), it gives me the following error:
raise RuntimeError(“No rendezvous handler for {}://”.format(result.scheme)) RuntimeError: No rendezvous handler for tcp://
Alternatively, I tried to set args.dist_url = "file:///E:/tmp.txt" but then I get the following error:
raise RuntimeError("Distributed package doesn’t have NCCL " RuntimeError: Distributed package doesn’t have NCCL built in
All these errors are raised when the init_process_group() function is called as following: torch.distributed.init_process_group(backend='nccl', init_method=args.dist_url, world_size=args.world_size, rank=args.rank)
Here, note that args.world_size=1 and rank=args.rank=0. Any help on this would be appreciated, especially on how to set up the tcp init method.
Following is a simple code which I tried on Ubuntu and it works, but it does not work on Windows. On Windows, it gives the same error (RuntimeError: No rendezvous handler for tcp://) when init_process_group() is called.
Hey @mbehzad, I noticed you are using NCCL. IIUC, NCCL is not available on Windows. Could you please try gloo instead?
Another question is that which PyTorch version are you using? In v1.7.*, the distributed package only supports FileStore rendezvous on Windows, TCPStore rendezvous is added in v1.8.