Strange behaviour of GLOO tcp transport

Hi @mrshenli,

Thank you very much for your answers before, and I recently countered another problem with the GLOO Backend. In one of my servers, I have 2 network interfaces: eno2 (10.1.3.6) and enp94s0f1 (10.1.3.2), and both of them can talk to a remote master node @10.1.3.1, using

ping -I 10.1.3.2 10.1.3.1
or
ping -I 10.1.3.6 10.1.3.1
image
image

Then in my PyTorch code, I want to use eno2 for my process group in this slave node, so I did in terminal
export GLOO_SOCKET_IFNAME=eno2
before launching the python code that executes:
dist.init_process_group(
** backend=‘gloo’,**
** init_method=‘tcp://10.1.3.1:12345’,**
** world_size=2,**
** rank=1,**
)
However, it turned out that the slave node was actually using enp94s0f1 (10.1.3.2) instead of the eno2 as I wanted.

If I turned down enp94s0f1 and just use eno2, the init_process_group will use eno2.

Could you help to solve this issue? My ultimate goal is that I want to specify a network interface to be used in a process and specify another network interface to be used in another process.

Thank you very much!

@mrshenli i face similar issue. when i run on 1 machine or 1 cloud platform like azure, init_rpc runs fine. which means all nodes are on same subnet. but if i run server (rank0) on 1 cloud platform and rank1 on different cloud platform. it is throwing an exception “RuntimeError: Gloo connectFullMesh failed with Connection reset by peer” . iam able to ping server from worker fine and vice versa. i even tried to tunnel both connections to a vpn server but same error. how do i solve this?

rank0

import torch
import torch.distributed.rpc as rpc
import os

os.environ[‘MASTER_ADDR’] = ‘100.8.0.5’
os.environ[‘MASTER_PORT’] = ‘3332’
rpc.init_rpc(“worker0”, rank=0, world_size=2)
ret = rpc.rpc_sync(“worker1”, torch.add, args=(torch.ones(2), 3))
rpc.shutdown()

#############rank1##########

import os
print(os.environ.get(‘GLOO_SOCKET_IFNAME’))
os.environ[‘MASTER_ADDR’] = ‘100.8.0.5’
os.environ[‘MASTER_PORT’] = ‘3332’

#os.environ[‘GLOO_SOCKET_IFNAME’]=‘nonexist’
#print(os.environ.get(‘GLOO_SOCKET_IFNAME’))

import torch.distributed.rpc as rpc
rpc.init_rpc(“worker1”, rank=1, world_size=2)
rpc.shutdown()

********error

[E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with […/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
Traceback (most recent call last):
File “/home/ubuntu/h_test.py”, line 13, in
rpc.init_rpc(“worker1”, rank=1, world_size=2)
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/init.py”, line 200, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/init.py”, line 233, in _init_rpc_backend
rpc_agent = backend_registry.init_backend(
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py”, line 104, in init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py”, line 324, in _tensorpipe_init_backend_handler
group = _init_process_group(store, rank, world_size)
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py”, line 112, in _init_process_group
group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: Gloo connectFullMesh failed with […/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

1 Like

Solved this by adding os.environ[“TP_SOCKET_IFNAME”]=“tun0” os.environ[“GLOO_SOCKET_IFNAME”]=“tun0” to where i called init_rpc. I was also tunnelling the communication through VPN.

1 Like

I’m facing the same issue, but it doesn’t go away even if I set GLOO_SOCKET_IFNAME and TP_SOCKET_IFNAME.

can you be more specific so i might help. what is the problem, structure of your distributed code and what error will let me help you