`torch.distributed.init_process_group` hangs with 4 gpus with `backend="NCCL"` but not `"gloo"`

Problem
Running a torch.distributed process on multiple 4 NVIDIA A100 80G gpus using NCCL backend hangs. This is not the case for backend gloo.

nvidia-smi info:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:01:00.0 Off |                    0 |
| N/A   41C    P0    55W / 300W |      2MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:25:00.0 Off |                    0 |
| N/A   40C    P0    57W / 300W |      2MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  Off  | 00000000:41:00.0 Off |                    0 |
| N/A   42C    P0    54W / 300W |      2MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  Off  | 00000000:C1:00.0 Off |                    0 |
| N/A   40C    P0    56W / 300W |      2MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Code to reproduce:

import argparse
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from datetime import timedelta


DEFAULT_TIMEOUT = timedelta(seconds=10)


def func(args):
    print(torch.randn(1))

def run(main_func, backend, num_machines, num_gpus, machine_rank, dist_url, args=()):
    world_size = num_machines * num_gpus

    mp.spawn(
        distributed_worker,
        nprocs=num_gpus,
        args=(
            main_func,
            backend,
            world_size,
            num_gpus,
            machine_rank,
            dist_url,
            args,
        ),
        daemon=False,
    )

def distributed_worker(
    local_rank,
    main_func,
    backend,
    world_size,
    num_gpus_per_machine,
    machine_rank,
    dist_url,
    args,
    timeout=DEFAULT_TIMEOUT,
):
    LOCAL_PROCESS_GROUP = None

    if not torch.cuda.is_available():
        raise RuntimeError("CUDA is not available. Please check your installation.")

    global_rank = machine_rank * num_gpus_per_machine + local_rank
    try:
        dist.init_process_group(
            backend=backend,
            init_method=dist_url,
            world_size=world_size,
            rank=global_rank,
            timeout=timeout,
        )
    except Exception as e:
        print(f"Process group URL: {dist_url}")
        raise e


    dist.barrier()

    print(f"Global rank {global_rank}.")
    print("Synchronized GPUs.")

    if num_gpus_per_machine > torch.cuda.device_count():
        raise RuntimeError
    torch.cuda.set_device(local_rank)

    # Setup the local process group (which contains ranks within the same machine)
    if LOCAL_PROCESS_GROUP is not None:
        raise RuntimeError

    num_machines = world_size // num_gpus_per_machine

    for idx in range(num_machines):
        ranks_on_i = list(range(idx * num_gpus_per_machine, (idx + 1) * num_gpus_per_machine))
        pg = dist.new_group(ranks_on_i)
        if idx == machine_rank:
            LOCAL_PROCESS_GROUP = pg

    main_func(args)


def main():
    torch.set_num_threads(1)
    os.environ["OMP_NUM_THREADS"] = "1"
    os.environ["NCCL_DEBUG"] = "INFO"

    print(f"CUDA {torch.version.cuda} - cuDNN {torch.backends.cudnn.version()} - cudaNCCL {torch.cuda.nccl.version()}")
    parser = argparse.ArgumentParser()
    parser.add_argument("--backend", type=str, default="NCCL", help="'gloo' or 'NCCL'.")
    parser.add_argument("--num-gpus", type=int, default=1, help="# GPUs per machine.")
    parser.add_argument("--num-machines", type=int, default=1, help="# of machines.")
    parser.add_argument(
        "--machine-rank",
        type=int,
        default=0,
        help="the rank of this machine (unique per machine).",
    )

    port = 1234
    parser.add_argument(
        "--dist-url",
        type=str,
        default=f"tcp://127.0.0.1:{port}",
        help="initialization URL for pytorch distributed backend. See "
        "https://pytorch.org/docs/stable/distributed.html for details.",
    )

    args = parser.parse_args()

    run(
        main_func=func,
        backend=args.backend,
        num_machines=args.num_machines,
        num_gpus=args.num_gpus,
        machine_rank=args.machine_rank,
        dist_url=args.dist_url,
        args=()
    )


if __name__ == '__main__':
    main()
  • By running python3 test.py --num-gpus 2 --backend NCCL or python3 test.py --num-gpus 4 --backend gloo no prpblem occurs.

  • By running

    python3 test.py  --num-gpus 4 --backend NCCL
    

    this hangs/freezes and I have to kill it.

Output

CUDA 11.3 - cuDNN 8200 - cudaNCCL (2, 10, 3)
eudoxus:4055975:4055975 [0] NCCL INFO Bootstrap : Using enp195s0f0:192.168.200.110<0>
eudoxus:4055975:4055975 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
eudoxus:4055975:4055975 [0] NCCL INFO NET/IB : No device found.
eudoxus:4055975:4055975 [0] NCCL INFO NET/Socket : Using [0]enp195s0f0:192.168.200.110<0> [1]eno1:192.168.200.104<0> [2]virbr0:192.168.122.1<0> [3]veth0b22d9b:fe80::7049:e9ff:fe1d:1e92%veth0b22d9b<0>
eudoxus:4055975:4055975 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
eudoxus:4055977:4055977 [2] NCCL INFO Bootstrap : Using enp195s0f0:192.168.200.110<0>
eudoxus:4055976:4055976 [1] NCCL INFO Bootstrap : Using enp195s0f0:192.168.200.110<0>
eudoxus:4055977:4055977 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
eudoxus:4055976:4055976 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
eudoxus:4055976:4055976 [1] NCCL INFO NET/IB : No device found.
eudoxus:4055977:4055977 [2] NCCL INFO NET/IB : No device found.
eudoxus:4055977:4055977 [2] NCCL INFO NET/Socket : Using [0]enp195s0f0:192.168.200.110<0> [1]eno1:192.168.200.104<0> [2]virbr0:192.168.122.1<0> [3]veth0b22d9b:fe80::7049:e9ff:fe1d:1e92%veth0b22d9b<0>
eudoxus:4055977:4055977 [2] NCCL INFO Using network Socket
eudoxus:4055976:4055976 [1] NCCL INFO NET/Socket : Using [0]enp195s0f0:192.168.200.110<0> [1]eno1:192.168.200.104<0> [2]virbr0:192.168.122.1<0> [3]veth0b22d9b:fe80::7049:e9ff:fe1d:1e92%veth0b22d9b<0>
eudoxus:4055976:4055976 [1] NCCL INFO Using network Socket
eudoxus:4055978:4055978 [3] NCCL INFO Bootstrap : Using enp195s0f0:192.168.200.110<0>
eudoxus:4055978:4055978 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
eudoxus:4055978:4055978 [3] NCCL INFO NET/IB : No device found.
eudoxus:4055978:4055978 [3] NCCL INFO NET/Socket : Using [0]enp195s0f0:192.168.200.110<0> [1]eno1:192.168.200.104<0> [2]virbr0:192.168.122.1<0> [3]veth0b22d9b:fe80::7049:e9ff:fe1d:1e92%veth0b22d9b<0>
eudoxus:4055978:4055978 [3] NCCL INFO Using network Socket
eudoxus:4055975:4056038 [0] NCCL INFO Channel 00/04 :    0   1   2   3
eudoxus:4055976:4056040 [1] NCCL INFO Trees [0] 3/-1/-1->1->0 [1] 0/-1/-1->1->3 [2] 3/-1/-1->1->0 [3] 0/-1/-1->1->3
eudoxus:4055977:4056039 [2] NCCL INFO Trees [0] 0/-1/-1->2->-1 [1] -1/-1/-1->2->0 [2] 0/-1/-1->2->-1 [3] -1/-1/-1->2->0
eudoxus:4055975:4056038 [0] NCCL INFO Channel 01/04 :    0   1   2   3
eudoxus:4055975:4056038 [0] NCCL INFO Channel 02/04 :    0   1   2   3
eudoxus:4055976:4056040 [1] NCCL INFO Setting affinity for GPU 1 to ffff00,00000000,00ffff00
eudoxus:4055978:4056041 [3] NCCL INFO Trees [0] -1/-1/-1->3->1 [1] 1/-1/-1->3->-1 [2] -1/-1/-1->3->1 [3] 1/-1/-1->3->-1
eudoxus:4055977:4056039 [2] NCCL INFO Setting affinity for GPU 2 to ffff00,00000000,00ffff00
eudoxus:4055975:4056038 [0] NCCL INFO Channel 03/04 :    0   1   2   3
eudoxus:4055975:4056038 [0] NCCL INFO Trees [0] 1/-1/-1->0->2 [1] 2/-1/-1->0->1 [2] 1/-1/-1->0->2 [3] 2/-1/-1->0->1
eudoxus:4055978:4056041 [3] NCCL INFO Setting affinity for GPU 3 to ffff00,00000000,00ffff00,00000000
eudoxus:4055975:4056038 [0] NCCL INFO Setting affinity for GPU 0 to ffff00,00000000,00ffff00
eudoxus:4055976:4056040 [1] NCCL INFO Channel 00 : 1[25000] -> 2[41000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 00 : 2[41000] -> 3[c1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 00 : 0[1000] -> 1[25000] via P2P/IPC/read
eudoxus:4055978:4056041 [3] NCCL INFO Channel 00 : 3[c1000] -> 0[1000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 01 : 1[25000] -> 2[41000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 01 : 2[41000] -> 3[c1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 01 : 0[1000] -> 1[25000] via P2P/IPC/read
eudoxus:4055978:4056041 [3] NCCL INFO Channel 01 : 3[c1000] -> 0[1000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 02 : 1[25000] -> 2[41000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 02 : 2[41000] -> 3[c1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 02 : 0[1000] -> 1[25000] via P2P/IPC/read
eudoxus:4055978:4056041 [3] NCCL INFO Channel 02 : 3[c1000] -> 0[1000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 03 : 1[25000] -> 2[41000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 03 : 2[41000] -> 3[c1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 03 : 0[1000] -> 1[25000] via P2P/IPC/read
eudoxus:4055978:4056041 [3] NCCL INFO Channel 03 : 3[c1000] -> 0[1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Connected all rings
eudoxus:4055976:4056040 [1] NCCL INFO Connected all rings
eudoxus:4055977:4056039 [2] NCCL INFO Connected all rings
eudoxus:4055978:4056041 [3] NCCL INFO Connected all rings
eudoxus:4055975:4056038 [0] NCCL INFO Channel 00 : 0[1000] -> 2[41000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 00 : 1[25000] -> 3[c1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 01 : 0[1000] -> 2[41000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 01 : 1[25000] -> 3[c1000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 00 : 2[41000] -> 0[1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 02 : 0[1000] -> 2[41000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 02 : 1[25000] -> 3[c1000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 01 : 2[41000] -> 0[1000] via P2P/IPC
eudoxus:4055978:4056041 [3] NCCL INFO Channel 00 : 3[c1000] -> 1[25000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 03 : 0[1000] -> 2[41000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 03 : 1[25000] -> 3[c1000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 02 : 2[41000] -> 0[1000] via P2P/IPC
eudoxus:4055978:4056041 [3] NCCL INFO Channel 01 : 3[c1000] -> 1[25000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 03 : 2[41000] -> 0[1000] via P2P/IPC
eudoxus:4055978:4056041 [3] NCCL INFO Channel 02 : 3[c1000] -> 1[25000] via P2P/IPC
eudoxus:4055978:4056041 [3] NCCL INFO Channel 03 : 3[c1000] -> 1[25000] via P2P/IPC
eudoxus:4055978:4056041 [3] NCCL INFO Connected all trees
eudoxus:4055978:4056041 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
eudoxus:4055978:4056041 [3] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
eudoxus:4055977:4056039 [2] NCCL INFO Connected all trees
eudoxus:4055977:4056039 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
eudoxus:4055977:4056039 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
eudoxus:4055976:4056040 [1] NCCL INFO Channel 00 : 1[25000] -> 0[1000] via P2P/IPC/read
eudoxus:4055976:4056040 [1] NCCL INFO Channel 01 : 1[25000] -> 0[1000] via P2P/IPC/read
eudoxus:4055976:4056040 [1] NCCL INFO Channel 02 : 1[25000] -> 0[1000] via P2P/IPC/read
eudoxus:4055976:4056040 [1] NCCL INFO Channel 03 : 1[25000] -> 0[1000] via P2P/IPC/read
eudoxus:4055975:4056038 [0] NCCL INFO Connected all trees
eudoxus:4055975:4056038 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
eudoxus:4055975:4056038 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
eudoxus:4055976:4056040 [1] NCCL INFO Connected all trees
eudoxus:4055976:4056040 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
eudoxus:4055976:4056040 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
eudoxus:4055977:4056039 [2] NCCL INFO comm 0x7fe4f8002fb0 rank 2 nranks 4 cudaDev 2 busId 41000 - Init COMPLETE
eudoxus:4055975:4056038 [0] NCCL INFO comm 0x7f4250002fb0 rank 0 nranks 4 cudaDev 0 busId 1000 - Init COMPLETE
eudoxus:4055978:4056041 [3] NCCL INFO comm 0x7f56a4002fb0 rank 3 nranks 4 cudaDev 3 busId c1000 - Init COMPLETE
eudoxus:4055976:4056040 [1] NCCL INFO comm 0x7f4400002fb0 rank 1 nranks 4 cudaDev 1 busId 25000 - Init COMPLETE
eudoxus:4055975:4055975 [0] NCCL INFO Launch mode Parallel

Output of `nvidia-smi after run:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:01:00.0 Off |                    0 |
| N/A   40C    P0    84W / 300W |   2225MiB / 80994MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:25:00.0 Off |                    0 |
| N/A   38C    P0    85W / 300W |   2191MiB / 80994MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  Off  | 00000000:41:00.0 Off |                    0 |
| N/A   40C    P0    80W / 300W |   2185MiB / 80994MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  Off  | 00000000:C1:00.0 Off |                    0 |
| N/A   39C    P0    84W / 300W |   2185MiB / 80994MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1923796      C   ...3/envs/direct/bin/python3     2223MiB |
|    1   N/A  N/A   1923797      C   ...3/envs/direct/bin/python3     2189MiB |
|    2   N/A  N/A   1923798      C   ...3/envs/direct/bin/python3     2183MiB |
|    3   N/A  N/A   1923799      C   ...3/envs/direct/bin/python3     2183MiB |
+-----------------------------------------------------------------------------+

I then need to kill all processes manually.

Versions

Collecting environment information...
PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.9.12 (main, Apr  5 2022, 06:56:58)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: NVIDIA A100 80GB PCIe
GPU 1: NVIDIA A100 80GB PCIe
GPU 2: NVIDIA A100 80GB PCIe
GPU 3: NVIDIA A100 80GB PCIe

Nvidia driver version: 470.103.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] numpydoc==1.2.1
[pip3] torch==1.11.0+cu113
[pip3] torchaudio==0.11.0+cu113
[pip3] torchvision==0.12.0+cu113
[conda] numpy                     1.22.3                   pypi_0    pypi
[conda] numpydoc                  1.2.1                    pypi_0    pypi
[conda] torch                     1.11.0+cu113             pypi_0    pypi
[conda] torchaudio                0.11.0+cu113             pypi_0    pypi
[conda] torchvision               0.12.0+cu113             pypi_0    pypi
1 Like

Was NCCL working before on this node?
If not, could you check if disabling IOMMU helps as described here as it could also cause a hang?

1 Like

@ptrblck Thank you! It turns out that the suggested solution worked!

1 Like

@ptrblck NCCL is working when I set --nodes 1 however the training gets stuck
at – torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
world_size=args.world_size, rank=args.rank)
when I change this to --nodes 2. Do you have any idea why this may happen? thanks for your help!

P.S. I have two nodes with 8 gpus each and the world size is correct = 16

I am running into similar issues using Optuna and DDP on a single node.

When I run the model with fixed hyperparameters it works with multiple GPUs.

When I try to run with Optuna it works with a single GPU but hangs with multiple GPUs with the NCCL backend on work.wait() in broadcast of distributed_c10d.py

When I use the gloo backend it works for a single GPU but with multiple GPUs it tells me:

Trial 1 failed with parameters: {} because of the following error: RuntimeError(β€˜[/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1670034520463/work/third_party/gloo/gloo/transport/tcp/pair.cc:589] Read error [127.0.0.1]:43754: Connection reset by peer’).

Any advice?

If you do not have root privileges, then modify your command by disabling the peer to peer transport in NCCL backend (as suggested here)

NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python file.py