Strange behaviour of GLOO tcp transport

Hi @mrshenli,

Thank you very much for your answers before, and I recently countered another problem with the GLOO Backend. In one of my servers, I have 2 network interfaces: eno2 ( and enp94s0f1 (, and both of them can talk to a remote master node @, using

ping -I
ping -I

Then in my PyTorch code, I want to use eno2 for my process group in this slave node, so I did in terminal
before launching the python code that executes:
** backend=‘gloo’,**
** init_method=‘tcp://’,**
** world_size=2,**
** rank=1,**
However, it turned out that the slave node was actually using enp94s0f1 ( instead of the eno2 as I wanted.

If I turned down enp94s0f1 and just use eno2, the init_process_group will use eno2.

Could you help to solve this issue? My ultimate goal is that I want to specify a network interface to be used in a process and specify another network interface to be used in another process.

Thank you very much!

@mrshenli i face similar issue. when i run on 1 machine or 1 cloud platform like azure, init_rpc runs fine. which means all nodes are on same subnet. but if i run server (rank0) on 1 cloud platform and rank1 on different cloud platform. it is throwing an exception “RuntimeError: Gloo connectFullMesh failed with Connection reset by peer” . iam able to ping server from worker fine and vice versa. i even tried to tunnel both connections to a vpn server but same error. how do i solve this?


import torch
import torch.distributed.rpc as rpc
import os

os.environ[‘MASTER_ADDR’] = ‘’
os.environ[‘MASTER_PORT’] = ‘3332’
rpc.init_rpc(“worker0”, rank=0, world_size=2)
ret = rpc.rpc_sync(“worker1”, torch.add, args=(torch.ones(2), 3))


import os
os.environ[‘MASTER_ADDR’] = ‘’
os.environ[‘MASTER_PORT’] = ‘3332’


import torch.distributed.rpc as rpc
rpc.init_rpc(“worker1”, rank=1, world_size=2)


[E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with […/third_party/gloo/gloo/transport/tcp/] no error
Traceback (most recent call last):
File “/home/ubuntu/”, line 13, in
rpc.init_rpc(“worker1”, rank=1, world_size=2)
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/”, line 200, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/”, line 233, in _init_rpc_backend
rpc_agent = backend_registry.init_backend(
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/”, line 104, in init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/”, line 324, in _tensorpipe_init_backend_handler
group = _init_process_group(store, rank, world_size)
File “/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/rpc/”, line 112, in _init_process_group
group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: Gloo connectFullMesh failed with […/third_party/gloo/gloo/transport/tcp/] no error

1 Like

Solved this by adding os.environ[“TP_SOCKET_IFNAME”]=“tun0” os.environ[“GLOO_SOCKET_IFNAME”]=“tun0” to where i called init_rpc. I was also tunnelling the communication through VPN.

1 Like

I’m facing the same issue, but it doesn’t go away even if I set GLOO_SOCKET_IFNAME and TP_SOCKET_IFNAME.

can you be more specific so i might help. what is the problem, structure of your distributed code and what error will let me help you