Strange behaviour of GLOO tcp transport

Hi @mrshenli,

Thank you very much for your answers before, and I recently countered another problem with the GLOO Backend. In one of my servers, I have 2 network interfaces: eno2 (10.1.3.6) and enp94s0f1 (10.1.3.2), and both of them can talk to a remote master node @10.1.3.1, using

ping -I 10.1.3.2 10.1.3.1
or
ping -I 10.1.3.6 10.1.3.1
image
image

Then in my PyTorch code, I want to use eno2 for my process group in this slave node, so I did in terminal
export GLOO_SOCKET_IFNAME=eno2
before launching the python code that executes:
dist.init_process_group(
** backend=‘gloo’,**
** init_method=‘tcp://10.1.3.1:12345’,**
** world_size=2,**
** rank=1,**
)
However, it turned out that the slave node was actually using enp94s0f1 (10.1.3.2) instead of the eno2 as I wanted.

If I turned down enp94s0f1 and just use eno2, the init_process_group will use eno2.

Could you help to solve this issue? My ultimate goal is that I want to specify a network interface to be used in a process and specify another network interface to be used in another process.

Thank you very much!