PyTorch's `dist_url` init method in Distributed Processing

Hi,

I am running a script for distributed processing on windows with 1 GPU. For that, I am using
torch.distributed. However, I am confused how to set the dist_url flag. Among many others, I went through this documentation but couldn’t get it.

When I used args.dist_url = 'tcp://localhost:58472' or args.dist_url = 'env://' or even when I manually calculate my IP address and a free port and then set the value as: args.dist_url = "tcp://{}:{}".format(ip,port), it gives me the following error:

raise RuntimeError(“No rendezvous handler for {}://”.format(result.scheme))
RuntimeError: No rendezvous handler for tcp://

Alternatively, I tried to set args.dist_url = "file:///E:/tmp.txt" but then I get the following error:

raise RuntimeError("Distributed package doesn’t have NCCL "
RuntimeError: Distributed package doesn’t have NCCL built in

All these errors are raised when the init_process_group() function is called as following:
torch.distributed.init_process_group(backend='nccl', init_method=args.dist_url, world_size=args.world_size, rank=args.rank)

Here, note that args.world_size=1 and rank=args.rank=0. Any help on this would be appreciated, especially on how to set up the tcp init method.

Try 127.0.0.1 instead of localhost (e.g. tcp://127.0.0.1:23456).

I tried with that too, but didn’t work.

Following is a simple code which I tried on Ubuntu and it works, but it does not work on Windows. On Windows, it gives the same error (RuntimeError: No rendezvous handler for tcp://) when init_process_group() is called.

import torch.distributed as dist

dist_url = 'tcp://127.0.0.1:23456'
dist.init_process_group(
   backend = 'nccl',
   rank=0,
   init_method=dist_url,
   world_size=1)
print("Done")

@mrshenli Do you have any idea why this TCP init method does not work on Windows?

Hey @mbehzad, I noticed you are using NCCL. IIUC, NCCL is not available on Windows. Could you please try gloo instead?

Another question is that which PyTorch version are you using? In v1.7.*, the distributed package only supports FileStore rendezvous on Windows, TCPStore rendezvous is added in v1.8.

1 Like

Hi @mrshenli you are right indeed. I updated to PyTorch 1.8.1 and used the gloo backend and file init_method() and it works.

For anyone wondering, following is a sample code that works on Windows. However, I do not suggest distributed processing on windows.

import torch.distributed as dist

dist.init_process_group(
   backend = 'gloo',
   rank=0,
   init_method='file:///e:/tmp/some_file',
   world_size=1)

print("Done")
1 Like