Simple multi-node NCCL script hangs at barrier

adrianwaelchli · May 10, 2021, 8:26pm

Hi there

I’m trying to run the following simple script on two machines:

import torch.distributed
import time
import argparse

def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int)
    parser.add_argument("--global_rank", type=int)
    args = parser.parse_args()

    torch.cuda.set_device(args.local_rank)

    print("init")
    torch.distributed.init_process_group(
        backend="nccl",
        init_method="tcp://10.10.10.22:1191",
        world_size=2,
        rank=args.global_rank,
    )
    time.sleep(5)
    print("barrier")
    torch.distributed.barrier()  # HANGS HERE


if __name__ == "__main__":
    main()

I have two machines with the IPs 10.10.10.22 (master) and 10.10.10.25 with port 1191 open on both.

On master:

export NCCL_SOCKET_IFNAME=eno1
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
python test.py --local_rank 0 --global_rank 0

lambda-server4:1793990:1793990 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1
lambda-server4:1793990:1793990 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eno1
lambda-server4:1793990:1793990 [0] NCCL INFO Bootstrap : Using [0]eno1:10.10.10.22<0>
lambda-server4:1793990:1793990 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
lambda-server4:1793990:1793990 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
lambda-server4:1793990:1793990 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1
lambda-server4:1793990:1793990 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eno1
lambda-server4:1793990:1793990 [0] NCCL INFO NET/Socket : Using [0]eno1:10.10.10.22<0>
lambda-server4:1793990:1793990 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1

On the other machine:

export NCCL_SOCKET_IFNAME=enp49s0f1
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
python test.py --local_rank 0 --global_rank 1

hyperplane1:1255526:1255526 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp49s0f1
hyperplane1:1255526:1255526 [1] NCCL INFO NCCL_SOCKET_IFNAME set to enp49s0f1
hyperplane1:1255526:1255526 [1] NCCL INFO Bootstrap : Using [0]enp49s0f1:10.10.10.25<0>
hyperplane1:1255526:1255526 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hyperplane1:1255526:1255526 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
hyperplane1:1255526:1255526 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp49s0f1
hyperplane1:1255526:1255526 [1] NCCL INFO NCCL_SOCKET_IFNAME set to enp49s0f1
hyperplane1:1255526:1255526 [1] NCCL INFO NET/Socket : Using [0]enp49s0f1:10.10.10.25<0>
hyperplane1:1255526:1255526 [1] NCCL INFO Using network Socket

hyperplane1:1266304:1266392 [0] NCCL INFO Call to connect returned Connection timed out, retrying
hyperplane1:1266304:1266392 [0] NCCL INFO Call to connect returned Connection timed out, retrying

hyperplane1:1266304:1266392 [0] include/socket.h:403 NCCL WARN Connect to 10.10.10.22<49177> failed : Connection timed out
hyperplane1:1266304:1266392 [0] NCCL INFO bootstrap.cc:95 -> 2
hyperplane1:1266304:1266392 [0] NCCL INFO bootstrap.cc:309 -> 2
hyperplane1:1266304:1266392 [0] NCCL INFO init.cc:555 -> 2
hyperplane1:1266304:1266392 [0] NCCL INFO init.cc:840 -> 2
hyperplane1:1266304:1266392 [0] NCCL INFO group.cc:73 -> 2 [Async thread]

The program hangs at the barrier call and I don’t know why. I get past the init_process_group call on both machines so I assume the connection between the two servers is fine, but at the barrier it times out.

Does anyone see the problem here? I have probably missed a configuration step, but I don’t know what.

PyTorch 1.8.1
NCCL version 2.7.8+cuda11.1

rvarm1 · May 10, 2021, 11:29pm

Hi, what is the output of ifconfig on both your machines? If you’re using the ifconfig output to set the NCCL_SOCKET_IFNAME variables on each node, you could try setting NCCL_SOCKET_IFNAME=eno1, enp49s0f1 as per the comments on this issue: How to set NCCL_SOCKET_IFNAME · Issue #286 · NVIDIA/nccl · GitHub, and in general make sure that the two interfaces can talk to each other according to your network setup.

In addition, can you try changing barrier() to allreduce some tensors (allocated on the appropriate GPU) and check whether that works as expected?

adrianwaelchli · May 11, 2021, 12:13am

Hi thanks for your input.

ifconfig returns me multiple interfaces. I picked the one that has the IP address assigned I use to login (10.10.10.22 and 10.10.10.25).
First machine:

br-5443622090a7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.18.0.1  netmask 255.255.0.0  broadcast 172.18.255.255
        inet6 fe80::42:1aff:fec8:448c  prefixlen 64  scopeid 0x20<link>
        ether 02:42:1a:c8:44:8c  txqueuelen 0  (Ethernet)
        RX packets 178271  bytes 4991588 (4.9 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 39  bytes 5694 (5.6 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        inet6 fe80::42:e5ff:fe0a:b382  prefixlen 64  scopeid 0x20<link>
        ether 02:42:e5:0a:b3:82  txqueuelen 0  (Ethernet)
        RX packets 76931  bytes 4131399 (4.1 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 86761  bytes 625271597 (625.2 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.10.10.22  netmask 255.255.255.0  broadcast 10.10.10.255
        inet6 fe80::3eec:efff:fe03:ed46  prefixlen 64  scopeid 0x20<link>
        ether 3c:ec:ef:03:ed:46  txqueuelen 1000  (Ethernet)
        RX packets 206957342  bytes 291080003217 (291.0 GB)
        RX errors 0  dropped 3340746  overruns 0  frame 0
        TX packets 47119321  bytes 7465744617 (7.4 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device memory 0xc1320000-c133ffff

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 31035316  bytes 3161891159 (3.1 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 31035316  bytes 3161891159 (3.1 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth212871c: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::e400:bff:feb2:acb0  prefixlen 64  scopeid 0x20<link>
        ether e6:00:0b:b2:ac:b0  txqueuelen 0  (Ethernet)
        RX packets 339096  bytes 21260368 (21.2 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 359220  bytes 21820500 (21.8 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth406ee85: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::d827:24ff:fe8c:3c0f  prefixlen 64  scopeid 0x20<link>
        ether da:27:24:8c:3c:0f  txqueuelen 0  (Ethernet)
        RX packets 1807  bytes 131066 (131.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2265  bytes 13311970 (13.3 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vethb0ca500: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::384f:23ff:fe37:a7b9  prefixlen 64  scopeid 0x20<link>
        ether 3a:4f:23:37:a7:b9  txqueuelen 0  (Ethernet)
        RX packets 338768  bytes 21241184 (21.2 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 359348  bytes 21826112 (21.8 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Second machine:

docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        inet6 fe80::42:e5ff:fe55:c1a2  prefixlen 64  scopeid 0x20<link>
        ether 02:42:e5:55:c1:a2  txqueuelen 0  (Ethernet)
        RX packets 990861  bytes 50900887 (50.9 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1872198  bytes 4152860383 (4.1 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp49s0f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.10.10.25  netmask 255.255.255.0  broadcast 10.10.10.255
        inet6 fe80::3eec:efff:fe1e:dd5b  prefixlen 64  scopeid 0x20<link>
        ether 3c:ec:ef:1e:dd:5b  txqueuelen 1000  (Ethernet)
        RX packets 82711207  bytes 110825052923 (110.8 GB)
        RX errors 0  dropped 489968  overruns 0  frame 0
        TX packets 25754173  bytes 2401860983 (2.4 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 36992814  bytes 2445844459 (2.4 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 36992814  bytes 2445844459 (2.4 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Tried setting NCCL_SOCKET_IFNAME=eno1, enp49s0f1 on both servers but it didn’t help unfortunately.

I replaced the barrier with an allreduce like so:

    x = torch.tensor([args.global_rank], dtype=torch.float, device=torch.device("cuda", 0))
    torch.distributed.all_reduce(x)
    print(x)

but it hangs the same way as with barrier.

The two network interfaces can talk to each other, I verified that I can listen on one machine and send a message through telnet to the other machine:

# on one server
nc -l 1191
# on the other server
telnet 10.10.10.22 1191

and this works both ways.

aykamko · February 16, 2023, 12:41am

I ran into this same issue and found a quick solution in this thread on Github: Question about nccl p2p disable · Issue #631 · NVIDIA/nccl · GitHub

TL;DR: Run your script with NCCL_P2P_DISABLE=1 as an environment variable. However, this may slow down your program since communication will happen through Shared Memory instead of direct GPU-to-GPU.

Longer explanation: I have IO Virtualization enabled in the BIOS on this machine. This is also known as PCI Access Control Services, ACS, VT-d, or IOMMU. (Lots of names for the same thing???) This NVIDIA documentation gives more info: Troubleshooting — NCCL 2.16.2 documentation

The other way to fix this is to disable IO Virtualization, which may or may not be desired.

adrianwaelchli · February 16, 2023, 2:45am

Thanks for the suggestions @aykamko. It looks like there can be several reasons leading to a hang like here. It’s been a long time since I posted this but if I remember correctly, I had to open and allow some ports for NCCL to communicate, and it was the reason for the hang. I think this troubleshooting setting here eventually solved it for me: Troubleshooting — NCCL 2.16.2 documentation