NCCL error right after init_process_group in barrier()

xuef · January 18, 2023, 2:19pm

Try to run this simple code in a 2 gpu environment, and got NCCL RuntimeError error. Appreciate your help!!

import torch
import os
import torch.distributed as dist

if __name__ == '__main__':
    print(f"RANK: {os.environ.get('RANK')} | WORLD_SIZE: {os.environ.get('WORLD_SIZE')} | LOCAL_RANK: {os.environ.get('LOCAL_RANK')}")
    print(f"MASTER_ADDR: {os.environ.get('MASTER_ADDR')} | MASTER_PORT: {os.environ.get('MASTER_PORT')}")
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    dist.init_process_group(backend='nccl', init_method='env://', rank=rank, world_size=world_size)
    dist.barrier()

$ NCCL_DEBUG=INFO torchrun --nnodes=1 --nproc_per_node=2 --master_port=29520 try_distributed.py

RANK: 0 | WORLD_SIZE: 2 | LOCAL_RANK: 0
MASTER_ADDR: 127.0.0.1 | MASTER_PORT: 29520
RANK: 1 | WORLD_SIZE: 2 | LOCAL_RANK: 1
MASTER_ADDR: 127.0.0.1 | MASTER_PORT: 29520
mlgpu5:847:847 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
mlgpu5:847:847 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

mlgpu5:847:847 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
mlgpu5:847:847 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
mlgpu5:847:847 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
mlgpu5:848:848 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
mlgpu5:848:848 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

mlgpu5:848:848 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
mlgpu5:848:848 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
mlgpu5:848:848 [0] NCCL INFO Using network Socket

mlgpu5:848:863 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 15000

mlgpu5:847:862 [0] init.cc:521 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 15000
mlgpu5:848:863 [0] NCCL INFO init.cc:904 → 5
mlgpu5:847:862 [0] NCCL INFO init.cc:904 → 5
mlgpu5:848:863 [0] NCCL INFO group.cc:72 → 5 [Async thread]
mlgpu5:847:862 [0] NCCL INFO group.cc:72 → 5 [Async thread]
Traceback (most recent call last):
File “/repo/explore/try_distributed.py”, line 11, in
Traceback (most recent call last):
File “/repo/explore/try_distributed.py”, line 11, in
dist.barrier()
File “/opt/miniconda3/envs/torch_v2/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py”, line 2709, in barrier
dist.barrier()
File “/opt/miniconda3/envs/torch_v2/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py”, line 2709, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1634272068185/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1634272068185/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

xuef · January 18, 2023, 3:50pm

I find out the problem here. Apparently one of the GPUs is not visible due to a setting issue. Once I resolved that. It works. So the duplicate GPU WARN needs to be attended.