I am trying to run simple code on 2 machines (both Windows 10).
When I am running main.py twice on first machine,
the code runs fine (2 processes, 1 on each GPU, 2 GPU’s total).
I have checked that ranks are correct.
Commands:
python main.py --start_rank=0
python main.py --start_rank=1
world_size=2
Backend is ‘gloo’
MASTER_ADDR is 127.0 0.1
MASTER_PORT is a free port.
When I am running main.py once on each machine, the code hangs forever (on both machines) at this line:
Which version of PyTorch are you using? In our nightly builds (and in the upcoming v1.11 release) we are outputting warning logs to troubleshoot connection issues during process initialization. Also make sure that your port range is open externally. Ping might work, but connection can still fail in such case.
I have opened the connection for all ports between the machines.
I can see that the anti-virus accepts them all.
(Some of them in tcp and some udp).
When i am looking at the logs, i can see that the connections are done in multiple random ports and not just on master port. I gess this is how gloo works ?
Would you recommend me to move to nightly build ? More ideas ?
If you don’t see any connection failures in the log files, this might also be an issue in your training script. For instance if you don’t call your collective operations in the same order in all your ranks, they will block waiting others.
Hi Omer, you may need to use ip addr or ifconfig to check the network interface that you need and export it in GLOO_SOCKET_IFNAME. Is the Wi-Fi/WiFI you get from these commands?
I was able to get the interface name from this command:
‘netsh interface show interface’
then i got this (example on first machine):
when i tried to use the interface names from ipcofig/all, i got this exceptions:
[enforce fail at …\third_party\gloo\gloo\transport\uv\device.cc:164] false. Unable to find address for: ‘the IFNAME that I tried’
I tried all the interface names…
Hey, I think that ip addr is for linux, and I am working with windows.
I have tried all names from ipconfig/all and non of then worked.
The only one that didn’t give an error was ‘Wi-FI’
(I took it from ‘netsh interface show interface’ command - image in previous message).
Can u think on something else ? why am I not getting timeout ? it just hangs forever…