Strange behaviour of GLOO tcp transport

HI all.
I have strange problem: I’m trying to run 2 tasks on 2 machines via following
trivial script:

dist.init_process_group(backend = "gloo",init_method = 'tcp://192.168.0.1:29500',rank = irank,world_size = iwsize)
arg = None
if(dist.get_rank()==0):
    arg = Dist_Trainer()
run(dist.get_rank(),dist.get_world_size(),arg)

When I run them on one machine, all works fine.
But when I start process with rank = 0 on one machine,
and process with rank = 1 on another machine,
process with rank = 0 fails with the following output:

python train_dist.py 0 2
RANK: 0 wsize: 2
terminate called after throwing an instance of ‘gloo::IoException’ what(): [/opt/conda/conda-bld/pytorch_1544176307774/work/third_party/gloo/gloo/transport/tcp/pair.cc:724] connect [127.0.0.1]:45965: Connection refused

This happens only when I start process with rank=1. If I don’t started it,
process with rank =0 is waiting for connection.

i.e.,I assume that tcp connection happens, but then process with rank = 0
tries to work with 127.0.0.1?

Upd: I tried setting export GLOO_SOCKET_IFNAME=enp2s0,
the problem still remains.

Looks like rank 0 is working with [127.0.0.1]:45965. Have you unset MASTER_ADDR and MASTER_PORT environment vars before launching the script?

Yes, they were unset.

By the way, if I swap scripts with rank=0 and rank=1 on these machines,
then script with rank=1crashes:

python train_dist.py 1 2
RANK: 1 wsize: 2
terminate called after throwing an instance of ‘gloo::IoException’
what(): [/opt/conda/conda-bld/pytorch_1544176307774/work/third_party/gloo/gloo/transport/tcp/pair.cc:724] connect [127.0.1.1]:3978: Connection refused

Script with rank=0 still waiting for connection

Hi @Oleg_Ivanov,

Have you solved this problem? Same as here.

Thanks,
Ziyi

If you just run the following without any other code, does it fail?

import torch.distributed as dist

# on rank 0
dist.init_process_group(
    backend = "gloo",
    init_method = 'tcp://192.168.0.1:29500',
    rank = 0,
    world_size = 2
)

# on rank 1
dist.init_process_group(
    backend = "gloo",
    init_method = 'tcp://192.168.0.1:29500',
    rank = 1,
    world_size = 2
)

Hey @ZiyiZhu Are you trying to run this with RPC? Currently init_rpc does not work together with init_process_group. There are work around to create non-default process groups. Or we can also add a fix to init_rpc if necessary. This is the tracking issue: https://github.com/pytorch/pytorch/issues/33583

Hi @mrshenli,

This is different from the RPC problem. Back then I was using Google Cloud VMs. The torch.distributed and RPC worked fine there.

However, just recently we built up new servers with GPU in our lab and connect them using an electrical packet switch. They can ping each other using the internal IP. For me now it is 10.1.1.101 for rank 0 and 10.1.1.102 for rank 1. So I run the following:

import torch.distributed as dist

# on rank 0
dist.init_process_group(
    backend = "gloo",
    init_method = 'tcp://10.1.1.101:29500',
    rank = 0,
    world_size = 2
)
import torch.distributed as dist

# on rank 1
dist.init_process_group(
    backend = "gloo",
    init_method = 'tcp://10.1.1.101:29500',
    rank = 1,
    world_size = 2
)

However, it failed with

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-532df564c254> in <module>
      6     init_method = 'tcp://10.1.1.101:29500',
      7     rank = 1,
----> 8     world_size = 2
      9 )

~/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in init_process_group(backend, init_method, timeout, world_size, rank, store, group_name)
    401             store,
    402             group_name=group_name,
--> 403             timeout=timeout)
    404 
    405     _pg_group_ranks[_default_pg] = {i: i for i in range(_default_pg.size())}

~/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _new_process_group_helper(world_size, rank, group_ranks, backend, store, group_name, timeout)
    469                 rank,
    470                 world_size,
--> 471                 timeout=timeout)
    472             _pg_map[pg] = (Backend.GLOO, store)
    473             _pg_names[pg] = group_name

RuntimeError: [/opt/conda/conda-bld/pytorch_1587428398394/work/third_party/gloo/gloo/transport/tcp/pair.cc:769] connect [127.0.0.1]:31662: Connection refused

Which I guess is the same problem for @Oleg_Ivanov too. In terms of

export GLOO_SOCKET_IFNAME=eno2

Should I simply do it in any terminal? eno2 is my NIC.
image

Please let me know if you have any thoughts. Thank you very much for your help!

Yes, either set it in terminal or pass GLOO_SOCKET_IFNAME=eno2 as a prefix to the command that launches the process.

Another cause might be hostname to ip mapping. IIUC, Gloo would try to resolve the the ip using the hostname. What does the following command return for you?

getent hosts `hostname`

Hi @mrshenli,

oh. This may be the problem. In the new servers (10.1.1.101 & 10.1.1.102), getent hosts hostname returns nothing.

I am also testing the torch.distributed with some old servers in my lab now and they can work. In one of the old servers, it does return IPs currently in use.
image
Where I am using 10.0.1.101 and 10.0.1.102 for testing the torch.distributed (old servers).

I will figure this out in the new servers and let you know! Thank you!

Best,
Ziyi

1 Like

Hi @mrshenli,

Problem solved. Like in the old server, I made 10.1.1.101 as a host in /etc/hosts and updated it in /etc/hostname. Now if I run getent hosts hostname, 10.1.1.101 host1 will pop up like in the screenshot below

However, it happens to be one of the NIC port (eno2)'s IP. What if I want to use another ethernet port which is 10.1.2.101 for another NIC port (eno3), do I need to change the /etc/hostname every time?

Thank you,

Looking at the code, this is not the expected behavior. It would always first try GLOO_SOCKET_IFNAME if that’s available. Somehow, it didn’t pick up the env var.

            char* ifnameEnv = getenv(GLOO_SOCKET_IFNAME_ENV);
            if (ifnameEnv) {
              for (const auto& iface : split(',', ifnameEnv)) {
                options.devices.push_back(
                    ::c10d::ProcessGroupGloo::createDeviceForInterface(iface));
              }
            } else {
              // If no hostname is specified, this function looks up
              // the machine's hostname and returns a device instance
              // associated with the address that the hostname resolves to.
              options.devices.push_back(
                  ::c10d::ProcessGroupGloo::createDefaultDevice());
            }

Can you try reading the GLOO_SOCKET_IFNAME env var immediately before init_process_group from Python and see if that gives the correct result?

Let me check Gloo code.

Gloo part looks correct to me:

Another way to test, is to use some non-exist interface, e.g.

export GLOO_SOCKET_IFNAME=nonexist

And then check if init_process_group throws the follow error for you:

    dist.init_process_group("gloo", rank=rank, world_size=world_size)
  File "/scratch/shenli/pytorch/torch/distributed/distributed_c10d.py", line 425, in init_process_group
    _default_pg = _new_process_group_helper(
  File "/scratch/shenli/pytorch/torch/distributed/distributed_c10d.py", line 499, in _new_process_group_helper
    pg = ProcessGroupGloo(
RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:83] ifa != nullptr. Unable to find address for: nonexist

Hi @mrshenli,

After I

export GLOO_SOCKET_IFNAME=nonexist

This is the error I got. Does it seem that it bypasses the nonexist and look at some others? If I add the master_address then it will just hang there for the second rank to come in.

Thanks,

can you keep/uncomment the init_method line or set MASTER_ADDR and MASTER_PORT? It seems failed during Python land arg checking due to missing master addr/port, before entering C++ pybind methods.

Hi @mrshenli,

I guess I found the problem. If I do export GLOO_SOCKET_IFNAME=nonexist in the terminal then it does not become an environment variable in the Jupter Notebook. But see it in Python launched from that terminal directly.

So I guess I have to do the other way around and set that in the JupyterNotebook explicitly? Here is the result if I do what you suggested.

import torch.distributed as dist
import os
​
print(os.environ.get('GLOO_SOCKET_IFNAME'))
​
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '23456'
​
os.environ['GLOO_SOCKET_IFNAME']='nonexist'
print(os.environ.get('GLOO_SOCKET_IFNAME'))
# on rank 0
dist.init_process_group(
    backend = "gloo",
    init_method = 'tcp://10.1.1.101:29500',
    rank = 0,
    world_size = 1
)
​
​
​
None
nonexist
----------------------------------------------------------------------
RuntimeError                         Traceback (most recent call last)
<ipython-input-1-ad5d77a63395> in <module>
     14     init_method = 'tcp://10.1.1.101:29500',
     15     rank = 0,
---> 16     world_size = 1
     17 )
     18 

~/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in init_process_group(backend, init_method, timeout, world_size, rank, store, group_name)
    401             store,
    402             group_name=group_name,
--> 403             timeout=timeout)
    404 
    405     _pg_group_ranks[_default_pg] = {i: i for i in range(_default_pg.size())}

~/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _new_process_group_helper(world_size, rank, group_ranks, backend, store, group_name, timeout)
    469                 rank,
    470                 world_size,
--> 471                 timeout=timeout)
    472             _pg_map[pg] = (Backend.GLOO, store)
    473             _pg_names[pg] = group_name

RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1587428398394/work/third_party/gloo/gloo/transport/tcp/device.cc:83] ifa != nullptr. Unable to find address for: nonexist

Thanks,
Ziyi

1 Like

Ah, I see. Yep, setting it directly in notebook should work I think.

1 Like

Hi @mrshenli,

To follow up with the issue below, I wonder if you could let me know more about why currently we cannot use the “nccl” as a backend for the communication for RCP? We have to explicitly copy the data to CPU and then do the transmission.

What are the concerns that the RPC does not do something similar to the DDP that has the GPU directly access to the NIC for the model parallel algorithm?

Thank you

To follow up with the issue below, I wonder if you could let me know more about why currently we cannot use the “nccl” as a backend for the communication for RCP?

This is because NCCL does not support p2p (send/recv) communication yet when we develop RPC. It is possible to use NCCL broadcast to mimic that send/recv, but that’s too hackish.

The p2p comm is coming to NCCL in v2.7. When that is ready, we probably can add it to ProcessGroupAgent or the new TensorPipeAgent (the latter is a more performant RPC agent implementation and should be able to use the best channels, e.g., IB/ETH/NvLink/etc.). See this PR: https://github.com/pytorch/pytorch/pull/35483

We have to explicitly copy the data to CPU and then do the transmission.

For Gloo backend even if application don’t copy the tensor from CUDA to CPU, Gloo would need to do that internally anyway. Hence, this explicit copy in application not a perf limitation when using Gloo backend.

We used to do that GPU-to-CPU copy implicitly in v1.4, but later realized that applications could run into unexpected errors if the destination device is not available on the callee. E.g., when I do rpc.rpc_sync(...., args=(torch.zeros(2).to(3),)) and if cuda:3 is not available on the callee, it would throw an error. So, we decided to make it explicit for applications.

What are the concerns that the RPC does not do something similar to the DDP that has the GPU directly access to the NIC for the model parallel algorithm?

From the API level, the difference is that DDP is supposed to run on a set of homogeneous servers, and RPC should be able to support heterogeneous clusters. So the device mismatch in RPC can be common. We are adding explicit device placement support (sth. similar to map_location on torch.save and torch.load) to the RPC API. This is an early issue to track. @osalpekar is working on a design RFC for that. Look forward to hear your comments when that RFC is posted. :slight_smile:

1 Like

Thank you very much for your detailed explanations! I agree that explicit can avoid lots of unexpected errors and really look forward to seeing RFC design.

Best,
Ziyi