Strange behaviour of GLOO tcp transport

ZiyiZhu · August 18, 2021, 6:17pm

Thank you very much for your answers before, and I recently countered another problem with the GLOO Backend. In one of my servers, I have 2 network interfaces: eno2 (10.1.3.6) and enp94s0f1 (10.1.3.2), and both of them can talk to a remote master node @10.1.3.1, using

ping -I 10.1.3.2 10.1.3.1
or
ping -I 10.1.3.6 10.1.3.1

Then in my PyTorch code, I want to use eno2 for my process group in this slave node, so I did in terminal
export GLOO_SOCKET_IFNAME=eno2
before launching the python code that executes:
dist.init_process_group(
** backend=‘gloo’,**
** init_method=‘tcp://10.1.3.1:12345’,**
** world_size=2,**
** rank=1,**
)
However, it turned out that the slave node was actually using enp94s0f1 (10.1.3.2) instead of the eno2 as I wanted.

If I turned down enp94s0f1 and just use eno2, the init_process_group will use eno2.

Could you help to solve this issue? My ultimate goal is that I want to specify a network interface to be used in a process and specify another network interface to be used in another process.

Thank you very much!

selineni · November 3, 2023, 9:29am

@mrshenli i face similar issue. when i run on 1 machine or 1 cloud platform like azure, init_rpc runs fine. which means all nodes are on same subnet. but if i run server (rank0) on 1 cloud platform and rank1 on different cloud platform. it is throwing an exception “RuntimeError: Gloo connectFullMesh failed with Connection reset by peer” . iam able to ping server from worker fine and vice versa. i even tried to tunnel both connections to a vpn server but same error. how do i solve this?

rank0

import torch
import torch.distributed.rpc as rpc
import os

os.environ[‘MASTER_ADDR’] = ‘100.8.0.5’
os.environ[‘MASTER_PORT’] = ‘3332’
rpc.init_rpc(“worker0”, rank=0, world_size=2)
ret = rpc.rpc_sync(“worker1”, torch.add, args=(torch.ones(2), 3))
rpc.shutdown()

#############rank1##########

import os
print(os.environ.get(‘GLOO_SOCKET_IFNAME’))
os.environ[‘MASTER_ADDR’] = ‘100.8.0.5’
os.environ[‘MASTER_PORT’] = ‘3332’

#os.environ[‘GLOO_SOCKET_IFNAME’]=‘nonexist’
#print(os.environ.get(‘GLOO_SOCKET_IFNAME’))

import torch.distributed.rpc as rpc
rpc.init_rpc(“worker1”, rank=1, world_size=2)
rpc.shutdown()

********error

[E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with […/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
Traceback (most recent call last):
File “/home/ubuntu/h_test.py”, line 13, in
rpc.init_rpc(“worker1”, rank=1, world_size=2)
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/init.py”, line 200, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/init.py”, line 233, in _init_rpc_backend
rpc_agent = backend_registry.init_backend(
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py”, line 104, in init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py”, line 324, in _tensorpipe_init_backend_handler
group = _init_process_group(store, rank, world_size)
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py”, line 112, in _init_process_group
group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: Gloo connectFullMesh failed with […/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

selineni · November 7, 2023, 8:47am

Solved this by adding os.environ[“TP_SOCKET_IFNAME”]=“tun0” os.environ[“GLOO_SOCKET_IFNAME”]=“tun0” to where i called init_rpc. I was also tunnelling the communication through VPN.

ericauld · November 23, 2023, 7:03am

I’m facing the same issue, but it doesn’t go away even if I set GLOO_SOCKET_IFNAME and TP_SOCKET_IFNAME.

selineni · January 4, 2024, 6:11pm

can you be more specific so i might help. what is the problem, structure of your distributed code and what error will let me help you

jimouris · May 8, 2024, 10:00pm

@selineni I’m also facing the same issue. The exact problem is described here by @ericauld.

selineni · May 9, 2024, 2:12pm

@jimouris if you are still facing problem, please describe your system setup like how many instances? connection type etc so i can provide my perspective.

jimouris · May 9, 2024, 2:45pm

@selineni Thanks for your quick response!

I’m using two c5n.9xlarge EC2 instances using Ubuntu 20.04.6 LTS. In the EC2 security settings I’ve opened ports 20000-30000 in both of them:

Both instances use ens5 interface (from ifconfig command). Some settings below:

export WORLD_SIZE=2; export RENDEZVOUS=env://; export MASTER_ADDR=172.X.Y.Z; export MASTER_PORT=29500; export RANK=0;
export WORLD_SIZE=2; export RENDEZVOUS=env://; export MASTER_ADDR=172.X.Y.Z; export MASTER_PORT=29500; export RANK=1;

Also, I tried with:

export WORLD_SIZE=2; export RENDEZVOUS=env://; export MASTER_ADDR=172.X.Y.Z; export MASTER_PORT=29500; export GLOO_SOCKET_IFNAME=ens5; export TP_SOCKET_IFNAME=ens5; export RANK=0;
export WORLD_SIZE=2; export RENDEZVOUS=env://; export MASTER_ADDR=172.X.Y.Z; export MASTER_PORT=29500; export GLOO_SOCKET_IFNAME=ens5; export TP_SOCKET_IFNAME=ens5; export RANK=1;

In both cases it’s hanging for a few minutes and then I’m getting:
RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

Let me know what other details I can provide. Thanks for the help!

selineni · May 9, 2024, 8:47pm

@jimouris, I had a similar setup but required encryption tunnels for communication. Since this is a peer-to-peer (P2P) setup, I found that each instance will always try to ping each other instance (checked that in wireshark from tcpdumps). If they cannot find each other (subnet differences can cause this problem), I received that error. Also, tunneling these instances through a VPN helped me. By doing that, all machines seem to be on the same subnet and are always available to connect. If your project architecture is okay with using VPN, Tailscale, or WireGuard, then you will have no problems

if using vpn, initiate this in init_rpc and check what is tun name in ipconfig. mine was tun0 when using vpn
os.environ[“TP_SOCKET_IFNAME”]=“tun0” os.environ[“GLOO_SOCKET_IFNAME”]=“tun0”

also i am assuming that you are using RPC for model parallel or similar architectures?

jimouris · May 9, 2024, 9:12pm

@selineni thanks for your response! Looking into it more, the two instances have:

completely different public IPv4 addresses
Similar private IPv4 addresses: X.Y.0.24 vs X.Y.9.57.
Both instances are on the same Subnet ID (Details tab in AWS console after you click on the instance name) and they have the same VPC ID (not sure if this matters though).

Running nc as described here I was able to talk to rank0 from the rank1 instance. Is VPN the only alternative?

Is there any case the rank0 needs to talk to rank1? Maybe a port is not open?

selineni · May 10, 2024, 3:05am

@jimouris that is strange. if it is on same subnet it should work. What i observed is 1st the rank0 communicates with rank1 on the master port and once it rendezvous, it opens random ephemeral ports on ~60000 for minor pings. as an experiment, can you check if you allow all ports to communicate between those 2 machines before you take tcpdump to check the log using wireshark or similar. i suspect you have a very restrictive rules. once you know the range of ports they are communicating on, you can adjust all that range.

jimouris · May 10, 2024, 2:08pm

I had opened ports in range 20000-30000; you were right about the ~60000 range ports! Although both instances were in the same subnet, an explicit inbound rule had to be specified for the subnet and ports.

Thanks for the help!