Problem
Running a torch.distributed process on multiple 4 NVIDIA A100 80G gpus using NCCL
backend hangs. This is not the case for backend gloo
.
nvidia-smi
info:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:01:00.0 Off | 0 |
| N/A 41C P0 55W / 300W | 2MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... Off | 00000000:25:00.0 Off | 0 |
| N/A 40C P0 57W / 300W | 2MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80G... Off | 00000000:41:00.0 Off | 0 |
| N/A 42C P0 54W / 300W | 2MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80G... Off | 00000000:C1:00.0 Off | 0 |
| N/A 40C P0 56W / 300W | 2MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Code to reproduce:
import argparse
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from datetime import timedelta
DEFAULT_TIMEOUT = timedelta(seconds=10)
def func(args):
print(torch.randn(1))
def run(main_func, backend, num_machines, num_gpus, machine_rank, dist_url, args=()):
world_size = num_machines * num_gpus
mp.spawn(
distributed_worker,
nprocs=num_gpus,
args=(
main_func,
backend,
world_size,
num_gpus,
machine_rank,
dist_url,
args,
),
daemon=False,
)
def distributed_worker(
local_rank,
main_func,
backend,
world_size,
num_gpus_per_machine,
machine_rank,
dist_url,
args,
timeout=DEFAULT_TIMEOUT,
):
LOCAL_PROCESS_GROUP = None
if not torch.cuda.is_available():
raise RuntimeError("CUDA is not available. Please check your installation.")
global_rank = machine_rank * num_gpus_per_machine + local_rank
try:
dist.init_process_group(
backend=backend,
init_method=dist_url,
world_size=world_size,
rank=global_rank,
timeout=timeout,
)
except Exception as e:
print(f"Process group URL: {dist_url}")
raise e
dist.barrier()
print(f"Global rank {global_rank}.")
print("Synchronized GPUs.")
if num_gpus_per_machine > torch.cuda.device_count():
raise RuntimeError
torch.cuda.set_device(local_rank)
# Setup the local process group (which contains ranks within the same machine)
if LOCAL_PROCESS_GROUP is not None:
raise RuntimeError
num_machines = world_size // num_gpus_per_machine
for idx in range(num_machines):
ranks_on_i = list(range(idx * num_gpus_per_machine, (idx + 1) * num_gpus_per_machine))
pg = dist.new_group(ranks_on_i)
if idx == machine_rank:
LOCAL_PROCESS_GROUP = pg
main_func(args)
def main():
torch.set_num_threads(1)
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["NCCL_DEBUG"] = "INFO"
print(f"CUDA {torch.version.cuda} - cuDNN {torch.backends.cudnn.version()} - cudaNCCL {torch.cuda.nccl.version()}")
parser = argparse.ArgumentParser()
parser.add_argument("--backend", type=str, default="NCCL", help="'gloo' or 'NCCL'.")
parser.add_argument("--num-gpus", type=int, default=1, help="# GPUs per machine.")
parser.add_argument("--num-machines", type=int, default=1, help="# of machines.")
parser.add_argument(
"--machine-rank",
type=int,
default=0,
help="the rank of this machine (unique per machine).",
)
port = 1234
parser.add_argument(
"--dist-url",
type=str,
default=f"tcp://127.0.0.1:{port}",
help="initialization URL for pytorch distributed backend. See "
"https://pytorch.org/docs/stable/distributed.html for details.",
)
args = parser.parse_args()
run(
main_func=func,
backend=args.backend,
num_machines=args.num_machines,
num_gpus=args.num_gpus,
machine_rank=args.machine_rank,
dist_url=args.dist_url,
args=()
)
if __name__ == '__main__':
main()
-
By running
python3 test.py --num-gpus 2 --backend NCCL
orpython3 test.py --num-gpus 4 --backend gloo
no prpblem occurs. -
By running
python3 test.py --num-gpus 4 --backend NCCL
this hangs/freezes and I have to kill it.
Output
CUDA 11.3 - cuDNN 8200 - cudaNCCL (2, 10, 3)
eudoxus:4055975:4055975 [0] NCCL INFO Bootstrap : Using enp195s0f0:192.168.200.110<0>
eudoxus:4055975:4055975 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
eudoxus:4055975:4055975 [0] NCCL INFO NET/IB : No device found.
eudoxus:4055975:4055975 [0] NCCL INFO NET/Socket : Using [0]enp195s0f0:192.168.200.110<0> [1]eno1:192.168.200.104<0> [2]virbr0:192.168.122.1<0> [3]veth0b22d9b:fe80::7049:e9ff:fe1d:1e92%veth0b22d9b<0>
eudoxus:4055975:4055975 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
eudoxus:4055977:4055977 [2] NCCL INFO Bootstrap : Using enp195s0f0:192.168.200.110<0>
eudoxus:4055976:4055976 [1] NCCL INFO Bootstrap : Using enp195s0f0:192.168.200.110<0>
eudoxus:4055977:4055977 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
eudoxus:4055976:4055976 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
eudoxus:4055976:4055976 [1] NCCL INFO NET/IB : No device found.
eudoxus:4055977:4055977 [2] NCCL INFO NET/IB : No device found.
eudoxus:4055977:4055977 [2] NCCL INFO NET/Socket : Using [0]enp195s0f0:192.168.200.110<0> [1]eno1:192.168.200.104<0> [2]virbr0:192.168.122.1<0> [3]veth0b22d9b:fe80::7049:e9ff:fe1d:1e92%veth0b22d9b<0>
eudoxus:4055977:4055977 [2] NCCL INFO Using network Socket
eudoxus:4055976:4055976 [1] NCCL INFO NET/Socket : Using [0]enp195s0f0:192.168.200.110<0> [1]eno1:192.168.200.104<0> [2]virbr0:192.168.122.1<0> [3]veth0b22d9b:fe80::7049:e9ff:fe1d:1e92%veth0b22d9b<0>
eudoxus:4055976:4055976 [1] NCCL INFO Using network Socket
eudoxus:4055978:4055978 [3] NCCL INFO Bootstrap : Using enp195s0f0:192.168.200.110<0>
eudoxus:4055978:4055978 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
eudoxus:4055978:4055978 [3] NCCL INFO NET/IB : No device found.
eudoxus:4055978:4055978 [3] NCCL INFO NET/Socket : Using [0]enp195s0f0:192.168.200.110<0> [1]eno1:192.168.200.104<0> [2]virbr0:192.168.122.1<0> [3]veth0b22d9b:fe80::7049:e9ff:fe1d:1e92%veth0b22d9b<0>
eudoxus:4055978:4055978 [3] NCCL INFO Using network Socket
eudoxus:4055975:4056038 [0] NCCL INFO Channel 00/04 : 0 1 2 3
eudoxus:4055976:4056040 [1] NCCL INFO Trees [0] 3/-1/-1->1->0 [1] 0/-1/-1->1->3 [2] 3/-1/-1->1->0 [3] 0/-1/-1->1->3
eudoxus:4055977:4056039 [2] NCCL INFO Trees [0] 0/-1/-1->2->-1 [1] -1/-1/-1->2->0 [2] 0/-1/-1->2->-1 [3] -1/-1/-1->2->0
eudoxus:4055975:4056038 [0] NCCL INFO Channel 01/04 : 0 1 2 3
eudoxus:4055975:4056038 [0] NCCL INFO Channel 02/04 : 0 1 2 3
eudoxus:4055976:4056040 [1] NCCL INFO Setting affinity for GPU 1 to ffff00,00000000,00ffff00
eudoxus:4055978:4056041 [3] NCCL INFO Trees [0] -1/-1/-1->3->1 [1] 1/-1/-1->3->-1 [2] -1/-1/-1->3->1 [3] 1/-1/-1->3->-1
eudoxus:4055977:4056039 [2] NCCL INFO Setting affinity for GPU 2 to ffff00,00000000,00ffff00
eudoxus:4055975:4056038 [0] NCCL INFO Channel 03/04 : 0 1 2 3
eudoxus:4055975:4056038 [0] NCCL INFO Trees [0] 1/-1/-1->0->2 [1] 2/-1/-1->0->1 [2] 1/-1/-1->0->2 [3] 2/-1/-1->0->1
eudoxus:4055978:4056041 [3] NCCL INFO Setting affinity for GPU 3 to ffff00,00000000,00ffff00,00000000
eudoxus:4055975:4056038 [0] NCCL INFO Setting affinity for GPU 0 to ffff00,00000000,00ffff00
eudoxus:4055976:4056040 [1] NCCL INFO Channel 00 : 1[25000] -> 2[41000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 00 : 2[41000] -> 3[c1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 00 : 0[1000] -> 1[25000] via P2P/IPC/read
eudoxus:4055978:4056041 [3] NCCL INFO Channel 00 : 3[c1000] -> 0[1000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 01 : 1[25000] -> 2[41000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 01 : 2[41000] -> 3[c1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 01 : 0[1000] -> 1[25000] via P2P/IPC/read
eudoxus:4055978:4056041 [3] NCCL INFO Channel 01 : 3[c1000] -> 0[1000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 02 : 1[25000] -> 2[41000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 02 : 2[41000] -> 3[c1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 02 : 0[1000] -> 1[25000] via P2P/IPC/read
eudoxus:4055978:4056041 [3] NCCL INFO Channel 02 : 3[c1000] -> 0[1000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 03 : 1[25000] -> 2[41000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 03 : 2[41000] -> 3[c1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 03 : 0[1000] -> 1[25000] via P2P/IPC/read
eudoxus:4055978:4056041 [3] NCCL INFO Channel 03 : 3[c1000] -> 0[1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Connected all rings
eudoxus:4055976:4056040 [1] NCCL INFO Connected all rings
eudoxus:4055977:4056039 [2] NCCL INFO Connected all rings
eudoxus:4055978:4056041 [3] NCCL INFO Connected all rings
eudoxus:4055975:4056038 [0] NCCL INFO Channel 00 : 0[1000] -> 2[41000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 00 : 1[25000] -> 3[c1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 01 : 0[1000] -> 2[41000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 01 : 1[25000] -> 3[c1000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 00 : 2[41000] -> 0[1000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 02 : 0[1000] -> 2[41000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 02 : 1[25000] -> 3[c1000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 01 : 2[41000] -> 0[1000] via P2P/IPC
eudoxus:4055978:4056041 [3] NCCL INFO Channel 00 : 3[c1000] -> 1[25000] via P2P/IPC
eudoxus:4055975:4056038 [0] NCCL INFO Channel 03 : 0[1000] -> 2[41000] via P2P/IPC
eudoxus:4055976:4056040 [1] NCCL INFO Channel 03 : 1[25000] -> 3[c1000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 02 : 2[41000] -> 0[1000] via P2P/IPC
eudoxus:4055978:4056041 [3] NCCL INFO Channel 01 : 3[c1000] -> 1[25000] via P2P/IPC
eudoxus:4055977:4056039 [2] NCCL INFO Channel 03 : 2[41000] -> 0[1000] via P2P/IPC
eudoxus:4055978:4056041 [3] NCCL INFO Channel 02 : 3[c1000] -> 1[25000] via P2P/IPC
eudoxus:4055978:4056041 [3] NCCL INFO Channel 03 : 3[c1000] -> 1[25000] via P2P/IPC
eudoxus:4055978:4056041 [3] NCCL INFO Connected all trees
eudoxus:4055978:4056041 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
eudoxus:4055978:4056041 [3] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
eudoxus:4055977:4056039 [2] NCCL INFO Connected all trees
eudoxus:4055977:4056039 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
eudoxus:4055977:4056039 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
eudoxus:4055976:4056040 [1] NCCL INFO Channel 00 : 1[25000] -> 0[1000] via P2P/IPC/read
eudoxus:4055976:4056040 [1] NCCL INFO Channel 01 : 1[25000] -> 0[1000] via P2P/IPC/read
eudoxus:4055976:4056040 [1] NCCL INFO Channel 02 : 1[25000] -> 0[1000] via P2P/IPC/read
eudoxus:4055976:4056040 [1] NCCL INFO Channel 03 : 1[25000] -> 0[1000] via P2P/IPC/read
eudoxus:4055975:4056038 [0] NCCL INFO Connected all trees
eudoxus:4055975:4056038 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
eudoxus:4055975:4056038 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
eudoxus:4055976:4056040 [1] NCCL INFO Connected all trees
eudoxus:4055976:4056040 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
eudoxus:4055976:4056040 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
eudoxus:4055977:4056039 [2] NCCL INFO comm 0x7fe4f8002fb0 rank 2 nranks 4 cudaDev 2 busId 41000 - Init COMPLETE
eudoxus:4055975:4056038 [0] NCCL INFO comm 0x7f4250002fb0 rank 0 nranks 4 cudaDev 0 busId 1000 - Init COMPLETE
eudoxus:4055978:4056041 [3] NCCL INFO comm 0x7f56a4002fb0 rank 3 nranks 4 cudaDev 3 busId c1000 - Init COMPLETE
eudoxus:4055976:4056040 [1] NCCL INFO comm 0x7f4400002fb0 rank 1 nranks 4 cudaDev 1 busId 25000 - Init COMPLETE
eudoxus:4055975:4055975 [0] NCCL INFO Launch mode Parallel
Output of `nvidia-smi after run:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:01:00.0 Off | 0 |
| N/A 40C P0 84W / 300W | 2225MiB / 80994MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... Off | 00000000:25:00.0 Off | 0 |
| N/A 38C P0 85W / 300W | 2191MiB / 80994MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80G... Off | 00000000:41:00.0 Off | 0 |
| N/A 40C P0 80W / 300W | 2185MiB / 80994MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80G... Off | 00000000:C1:00.0 Off | 0 |
| N/A 39C P0 84W / 300W | 2185MiB / 80994MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1923796 C ...3/envs/direct/bin/python3 2223MiB |
| 1 N/A N/A 1923797 C ...3/envs/direct/bin/python3 2189MiB |
| 2 N/A N/A 1923798 C ...3/envs/direct/bin/python3 2183MiB |
| 3 N/A N/A 1923799 C ...3/envs/direct/bin/python3 2183MiB |
+-----------------------------------------------------------------------------+
I then need to kill all processes manually.
Versions
Collecting environment information...
PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31
Python version: 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: NVIDIA A100 80GB PCIe
GPU 1: NVIDIA A100 80GB PCIe
GPU 2: NVIDIA A100 80GB PCIe
GPU 3: NVIDIA A100 80GB PCIe
Nvidia driver version: 470.103.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] numpydoc==1.2.1
[pip3] torch==1.11.0+cu113
[pip3] torchaudio==0.11.0+cu113
[pip3] torchvision==0.12.0+cu113
[conda] numpy 1.22.3 pypi_0 pypi
[conda] numpydoc 1.2.1 pypi_0 pypi
[conda] torch 1.11.0+cu113 pypi_0 pypi
[conda] torchaudio 0.11.0+cu113 pypi_0 pypi
[conda] torchvision 0.12.0+cu113 pypi_0 pypi