Code get stuck at dist.init_process_group with 2 machines

Omer_Cohen1 · February 18, 2022, 12:37pm

Hey,

I am trying to run simple code on 2 machines (both Windows 10).

When I am running main.py twice on first machine,
the code runs fine (2 processes, 1 on each GPU, 2 GPU’s total).
I have checked that ranks are correct.
Commands:
python main.py --start_rank=0
python main.py --start_rank=1

world_size=2
Backend is ‘gloo’
MASTER_ADDR is 127.0 0.1
MASTER_PORT is a free port.

When I am running main.py once on each machine, the code hangs forever (on both machines) at this line:

  dist.init_process_group(
      backend=args.backend,
      init_method=args.init_method,
      world_size=args.world_size,
      rank=args.rank,
      timeout=timedelta(seconds=30)
  )

world_size=2
Backend is ‘gloo’.
MASTER_ADDR is the ip of the first machine (machine with rank 0)
MASTER_PORT is a free port.

I was able to ping between the machines,
and created anti-virus rules for connections between them.

I think it might be a connection issue (because it works on first machine) but not sure (because I am not getting timeout after 30 seconds).

Any ideas?

cbalioglu · February 22, 2022, 1:31pm

Which version of PyTorch are you using? In our nightly builds (and in the upcoming v1.11 release) we are outputting warning logs to troubleshoot connection issues during process initialization. Also make sure that your port range is open externally. Ping might work, but connection can still fail in such case.

Omer_Cohen1 · February 22, 2022, 1:42pm

Thanks.

I am using 1.10.2

I have opened the connection for all ports between the machines.
I can see that the anti-virus accepts them all.
(Some of them in tcp and some udp).

When i am looking at the logs, i can see that the connections are done in multiple random ports and not just on master port. I gess this is how gloo works ?

Would you recommend me to move to nightly build ? More ideas ?

cbalioglu · February 22, 2022, 2:06pm

If you don’t see any connection failures in the log files, this might also be an issue in your training script. For instance if you don’t call your collective operations in the same order in all your ranks, they will block waiting others.

Omer_Cohen1 · February 22, 2022, 2:17pm

The code hangs forever at init_process_group, the are no collective operations before this line.

Yanli_Zhao · February 23, 2022, 10:29pm

Try to export GLOO_SOCKET_IFNAME?

Omer_Cohen1 · February 24, 2022, 4:47am

Yes, i set os.environ[GLOO_SOCKET_IFNAME],

On first machine it is ‘Wi-Fi’
On second machine it is ‘WiFi’

Didn’t help…

HuYang719 · February 24, 2022, 11:36am

Hi Omer, you may need to use ip addr or ifconfig to check the network interface that you need and export it in GLOO_SOCKET_IFNAME. Is the Wi-Fi/WiFI you get from these commands?

Omer_Cohen1 · February 25, 2022, 12:49pm

I was able to get the interface name from this command:
‘netsh interface show interface’

then i got this (example on first machine):

when i tried to use the interface names from ipcofig/all, i got this exceptions:
[enforce fail at …\third_party\gloo\gloo\transport\uv\device.cc:164] false. Unable to find address for: ‘the IFNAME that I tried’
I tried all the interface names…

am I doing it correctly ?

HuYang719 · February 25, 2022, 2:25pm

Hi I suggested you use ip addr, you may see something like: lo / eth0 /..., you may choose one as interface.

Omer_Cohen1 · February 26, 2022, 6:26pm

Hey, I think that ip addr is for linux, and I am working with windows.
I have tried all names from ipconfig/all and non of then worked.
The only one that didn’t give an error was ‘Wi-FI’
(I took it from ‘netsh interface show interface’ command - image in previous message).

Can u think on something else ? why am I not getting timeout ? it just hangs forever…