torch.distributed.DistBackendError: NCCL error

Hello! When running a DeepSpeed training job, I get this error: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331

The job works fine on a single node but gives this error in a multi-node setup.

Any suggestions are appreciated.

Setup
2x NDv2 VMs. 8X V100 per VM.
ibX Infiniband IP interface for both nodes. As the logs below show, NCCL detects this interface correctly so NCCL_SOCKET_IFNAME is not set

Software (Identical to both nodes)
torch.__version__2.1.0+cu121
torch.cuda.nccl.version()(2, 18, 1)
libncclVersion: 2.19.3-1+cuda12.3

Env variables
NCCL_DEBUG=INFO

Log


MLVM: MLVM:12343:12343 [0] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12343:12343 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM: MLVM:12343:12343 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM: MLVM:12343:12343 [0] NCCL INFO cudaDriverVersion 12030

MLVM: NCCL version 2.18.1+cuda12.1

MLVM: MLVM:12353:12353 [7] NCCL INFO cudaDriverVersion 12030

MLVM: MLVM:12353:12353 [7] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12347:12347 [4] NCCL INFO cudaDriverVersion 12030

MLVM: MLVM:12353:12353 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM: MLVM:12353:12353 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM: MLVM:12347:12347 [4] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12347:12347 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM: MLVM:12347:12347 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM2: MLVM2:10793:10793 [1] NCCL INFO cudaDriverVersion 12030

MLVM2: MLVM2:10799:10799 [6] NCCL INFO cudaDriverVersion 12030

MLVM2: MLVM2:10793:10793 [1] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10799:10799 [6] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10799:10799 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM2: MLVM2:10799:10799 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM2: MLVM2:10793:10793 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM2: MLVM2:10793:10793 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM: MLVM:12350:12350 [6] NCCL INFO cudaDriverVersion 12030

MLVM: MLVM:12350:12350 [6] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>

MLVM2: MLVM2:10796:10796 [4] NCCL INFO cudaDriverVersion 12030

MLVM: MLVM:12350:12350 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM: MLVM:12350:12350 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM2: MLVM2:10796:10796 [4] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10796:10796 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM2: MLVM2:10796:10796 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM: MLVM:12346:12346 [3] NCCL INFO cudaDriverVersion 12030

MLVM2: MLVM2:10797:10797 [5] NCCL INFO cudaDriverVersion 12030

MLVM: MLVM:12346:12346 [3] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>

MLVM2: MLVM2:10797:10797 [5] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>

MLVM: MLVM:12348:12348 [5] NCCL INFO cudaDriverVersion 12030

MLVM: MLVM:12346:12346 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM: MLVM:12346:12346 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM2: MLVM2:10797:10797 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM2: MLVM2:10797:10797 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM: MLVM:12348:12348 [5] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12348:12348 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM: MLVM:12348:12348 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM: MLVM:12344:12344 [1] NCCL INFO cudaDriverVersion 12030

MLVM: MLVM:12344:12344 [1] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12344:12344 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM: MLVM:12344:12344 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM2: MLVM2:10802:10802 [7] NCCL INFO cudaDriverVersion 12030

MLVM2: MLVM2:10802:10802 [7] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10802:10802 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM2: MLVM2:10802:10802 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM2: MLVM2:10792:10792 [0] NCCL INFO cudaDriverVersion 12030

MLVM2: MLVM2:10792:10792 [0] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10792:10792 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM2: MLVM2:10792:10792 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM: MLVM:12345:12345 [2] NCCL INFO cudaDriverVersion 12030

MLVM: MLVM:12345:12345 [2] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12345:12345 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM: MLVM:12345:12345 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM2: MLVM2:10795:10795 [3] NCCL INFO cudaDriverVersion 12030

MLVM2: MLVM2:10795:10795 [3] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10795:10795 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM2: MLVM2:10795:10795 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM2: MLVM2:10794:10794 [2] NCCL INFO cudaDriverVersion 12030

MLVM2: MLVM2:10794:10794 [2] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10794:10794 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory

MLVM2: MLVM2:10794:10794 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation

MLVM: MLVM:12343:13006 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12343:13006 [0] NCCL INFO Using network IB

MLVM: MLVM:12348:13009 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12348:13009 [5] NCCL INFO Using network IB

MLVM: MLVM:12353:13008 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12353:13008 [7] NCCL INFO Using network IB

MLVM: MLVM:12347:13012 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12347:13012 [4] NCCL INFO Using network IB

MLVM: MLVM:12346:13010 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12346:13010 [3] NCCL INFO Using network IB

MLVM: MLVM:12344:13011 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12344:13011 [1] NCCL INFO Using network IB

MLVM2: MLVM2:10799:11392 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10799:11392 [6] NCCL INFO Using network IB

MLVM: MLVM:12350:13007 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12350:13007 [6] NCCL INFO Using network IB

MLVM2: MLVM2:10796:11396 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10796:11396 [4] NCCL INFO Using network IB

MLVM2: MLVM2:10802:11393 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10802:11393 [7] NCCL INFO Using network IB

MLVM2: MLVM2:10797:11395 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10797:11395 [5] NCCL INFO Using network IB

MLVM: MLVM:12345:13013 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>

MLVM: MLVM:12345:13013 [2] NCCL INFO Using network IB

MLVM2: MLVM2:10795:11398 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10795:11398 [3] NCCL INFO Using network IB

MLVM2: MLVM2:10794:11399 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10794:11399 [2] NCCL INFO Using network IB

MLVM2: MLVM2:10792:11397 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10792:11397 [0] NCCL INFO Using network IB

MLVM2: MLVM2:10793:11394 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>

MLVM2: MLVM2:10793:11394 [1] NCCL INFO Using network IB

MLVM2: MLVM2:10796:11396 [4] NCCL INFO misc/socket.cc:564 -> 2

MLVM2: MLVM2:10802:11393 [7] NCCL INFO misc/socket.cc:564 -> 2

MLVM2: MLVM2:10797:11395 [5] NCCL INFO misc/socket.cc:564 -> 2

MLVM2: MLVM2:10802:11393 [7] NCCL INFO misc/socket.cc:615 -> 2

MLVM2: MLVM2:10796:11396 [4] NCCL INFO misc/socket.cc:615 -> 2

MLVM2: MLVM2:10797:11395 [5] NCCL INFO misc/socket.cc:615 -> 2

MLVM2: MLVM2:10802:11393 [7] NCCL INFO bootstrap.cc:270 -> 2

MLVM2: MLVM2:10796:11396 [4] NCCL INFO bootstrap.cc:270 -> 2

MLVM2: MLVM2:10797:11395 [5] NCCL INFO bootstrap.cc:270 -> 2

MLVM2: MLVM2:10802:11393 [7] NCCL INFO init.cc:1303 -> 2

MLVM2: MLVM2:10796:11396 [4] NCCL INFO init.cc:1303 -> 2

MLVM2: MLVM2:10797:11395 [5] NCCL INFO init.cc:1303 -> 2

MLVM2: MLVM2:10802:11393 [7] NCCL INFO group.cc:64 -> 2 [Async thread]

MLVM2: MLVM2:10796:11396 [4] NCCL INFO group.cc:64 -> 2 [Async thread]

MLVM2: MLVM2:10797:11395 [5] NCCL INFO group.cc:64 -> 2 [Async thread]

MLVM2: MLVM2:10795:11398 [3] NCCL INFO misc/socket.cc:564 -> 2

MLVM2: MLVM2:10795:11398 [3] NCCL INFO misc/socket.cc:615 -> 2

MLVM2: MLVM2:10795:11398 [3] NCCL INFO bootstrap.cc:270 -> 2

MLVM2: MLVM2:10793:11394 [1] NCCL INFO misc/socket.cc:564 -> 2

MLVM2: MLVM2:10795:11398 [3] NCCL INFO init.cc:1303 -> 2

MLVM2: MLVM2:10793:11394 [1] NCCL INFO misc/socket.cc:615 -> 2

MLVM2: MLVM2:10795:11398 [3] NCCL INFO group.cc:64 -> 2 [Async thread]

MLVM2: MLVM2:10793:11394 [1] NCCL INFO bootstrap.cc:270 -> 2

MLVM2: MLVM2:10799:11392 [6] NCCL INFO misc/socket.cc:564 -> 2

MLVM2: MLVM2:10793:11394 [1] NCCL INFO init.cc:1303 -> 2

MLVM2: MLVM2:10799:11392 [6] NCCL INFO misc/socket.cc:615 -> 2

MLVM2: MLVM2:10793:11394 [1] NCCL INFO group.cc:64 -> 2 [Async thread]

MLVM2: MLVM2:10799:11392 [6] NCCL INFO bootstrap.cc:270 -> 2

MLVM2: MLVM2:10799:11392 [6] NCCL INFO init.cc:1303 -> 2

MLVM2: MLVM2:10799:11392 [6] NCCL INFO group.cc:64 -> 2 [Async thread]

MLVM2: MLVM2:10792:11397 [0] NCCL INFO misc/socket.cc:564 -> 2

MLVM2: MLVM2:10792:11397 [0] NCCL INFO misc/socket.cc:615 -> 2

MLVM2: MLVM2:10792:11397 [0] NCCL INFO bootstrap.cc:270 -> 2

MLVM2: MLVM2:10792:11397 [0] NCCL INFO init.cc:1303 -> 2

MLVM2: MLVM2:10794:11399 [2] NCCL INFO misc/socket.cc:564 -> 2

MLVM2: MLVM2:10792:11397 [0] NCCL INFO group.cc:64 -> 2 [Async thread]

MLVM2: MLVM2:10794:11399 [2] NCCL INFO misc/socket.cc:615 -> 2

MLVM2: MLVM2:10794:11399 [2] NCCL INFO bootstrap.cc:270 -> 2

MLVM2: MLVM2:10794:11399 [2] NCCL INFO init.cc:1303 -> 2

MLVM2: MLVM2:10794:11399 [2] NCCL INFO group.cc:64 -> 2 [Async thread]

MLVM2: MLVM2:10797:10797 [5] NCCL INFO group.cc:422 -> 2

MLVM2: MLVM2:10795:10795 [3] NCCL INFO group.cc:422 -> 2

MLVM2: MLVM2:10797:10797 [5] NCCL INFO group.cc:106 -> 2

MLVM2: MLVM2:10795:10795 [3] NCCL INFO group.cc:106 -> 2

MLVM2: MLVM2:10796:10796 [4] NCCL INFO group.cc:422 -> 2

MLVM2: MLVM2:10802:10802 [7] NCCL INFO group.cc:422 -> 2

MLVM2: MLVM2:10796:10796 [4] NCCL INFO group.cc:106 -> 2

MLVM2: MLVM2:10802:10802 [7] NCCL INFO group.cc:106 -> 2

MLVM2: MLVM2:10794:10794 [2] NCCL INFO group.cc:422 -> 2

MLVM2: MLVM2:10793:10793 [1] NCCL INFO group.cc:422 -> 2

MLVM2: MLVM2:10794:10794 [2] NCCL INFO group.cc:106 -> 2

MLVM2: MLVM2:10793:10793 [1] NCCL INFO group.cc:106 -> 2

MLVM2: MLVM2:10799:10799 [6] NCCL INFO group.cc:422 -> 2

MLVM2: MLVM2:10799:10799 [6] NCCL INFO group.cc:106 -> 2

MLVM2: MLVM2:10792:10792 [0] NCCL INFO group.cc:422 -> 2

MLVM2: MLVM2:10792:10792 [0] NCCL INFO group.cc:106 -> 2

MLVM2: Traceback (most recent call last):

MLVM2: Traceback (most recent call last):

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>

MLVM2: Traceback (most recent call last):

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>

MLVM2: Traceback (most recent call last):

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>

MLVM2: Traceback (most recent call last):

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>

MLVM2: Traceback (most recent call last):

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>

MLVM2: pretrain(train_valid_test_datasets_provider,

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain

MLVM2: pretrain(train_valid_test_datasets_provider,

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain

MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron

MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,

MLVM2: pretrain(train_valid_test_datasets_provider, File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron

MLVM2:

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain

MLVM2: _compile_dependencies()

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies

MLVM2: _compile_dependencies()

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies

MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,pretrain(train_valid_test_datasets_provider,torch.distributed.barrier()

MLVM2:

MLVM2:

MLVM2: pretrain(train_valid_test_datasets_provider, File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron

MLVM2: torch.distributed.barrier()

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper

MLVM2:

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper

MLVM2: return func(*args, **kwargs)_compile_dependencies()return func(*args, **kwargs)

MLVM2:

MLVM2:

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier

MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies

MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron

MLVM2: _compile_dependencies()torch.distributed.barrier()

MLVM2:

MLVM2: _compile_dependencies()

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies

MLVM2: pretrain(train_valid_test_datasets_provider,

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain

MLVM2: return func(*args, **kwargs)

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier

MLVM2: torch.distributed.barrier()

MLVM2: torch.distributed.barrier()

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper

MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron

MLVM2: return func(*args, **kwargs)

MLVM2: return func(*args, **kwargs)

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier

MLVM2: _compile_dependencies()

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies

MLVM2: Traceback (most recent call last):

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>

MLVM2: torch.distributed.barrier()

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper

MLVM2: return func(*args, **kwargs)

MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier

MLVM2: work = default_pg.barrier(opts=opts)

MLVM2: work = default_pg.barrier(opts=opts)

MLVM2: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1

MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.

MLVM2: Last error:

MLVM2:

MLVM2: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1

MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.

MLVM2: Last error:
MLVM2: _compile_dependencies()torch.distributed.barrier()

What is _compile_dependencies() doing and could it be failing?

Thanks @ptrblck for the quick response! _compile_dependencies() can be found here from Megatron-Deepspeed

At first glance, all it does is load precompiled fused kernels on rank 0 and imposes a barrier until that’s done. The stack trace terminates at torch.distributed.barrier() so I assume that must be where the issue stems from.

Could you check if the dependency loading is executed successfully e.g. just by printing a debug statement after its execution?

I added debug statements (highlighted green to the left) and from the logs below you’ll see the error comes from the torch.distributed.barrier() calls highlighted with red arrows.

That is, Rank 0, loads the kernels and calls the barrier but never passes (likewise for other ranks) to the next ... passed barrier statement

MLVM: > Rank_0 done loading fused kernels!
MLVM: MLVM:54184:54184 [0] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>
MLVM: MLVM:54184:54184 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:54184:54184 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:54184:54184 [0] NCCL INFO cudaDriverVersion 12030
MLVM: NCCL version 2.18.1+cuda12.1
MLVM: MLVM:54186:54186 [2] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:54186:54186 [2] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>
MLVM: MLVM:54185:54185 [1] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:54186:54186 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:54186:54186 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:54185:54185 [1] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>
MLVM2: MLVM2:12152:12152 [7] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:54185:54185 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:54185:54185 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:12152:12152 [7] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:12152:12152 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:12152:12152 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:54194:54194 [7] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:54194:54194 [7] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>
MLVM: MLVM:54194:54194 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:54194:54194 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
.
.
.
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2:         pretrain(train_valid_test_datasets_provider,pretrain(train_valid_test_datasets_provider,
MLVM2: 
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2:     pretrain(train_valid_test_datasets_provider,
MLVM2:     initialize_megatron(extra_args_provider=extra_args_provider,  File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2: 
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2:     initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2:     _compile_dependencies()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2:     initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2:     _compile_dependencies()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2:     torch.distributed.barrier()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:     _compile_dependencies()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2:     torch.distributed.barrier()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:     return func(*args, **kwargs)
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2:     torch.distributed.barrier()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:     return func(*args, **kwargs)
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2:     return func(*args, **kwargs)
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2:     pretrain(train_valid_test_datasets_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2:     initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2:     _compile_dependencies()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2:     torch.distributed.barrier()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:     return func(*args, **kwargs)
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2:     work = default_pg.barrier(opts=opts)
MLVM2: torch.distributed.DistBackendError:     NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: work = default_pg.barrier(opts=opts)
MLVM2: 
MLVM2:     work = default_pg.barrier(opts=opts)
MLVM2: torch.distributed.DistBackendErrortorch.distributed: .NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: DistBackendError
MLVM2: : NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: 
MLVM2:     work = default_pg.barrier(opts=opts)
MLVM2: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: 
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2:     pretrain(train_valid_test_datasets_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2:     pretrain(train_valid_test_datasets_provider,    initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2: 
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2:     _compile_dependencies()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2:     initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2:       File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2: torch.distributed.barrier()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:     return func(*args, **kwargs)
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2:     _compile_dependencies()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2:     torch.distributed.barrier()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:     return func(*args, **kwargs)
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2:     work = default_pg.barrier(opts=opts)
MLVM2: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: 
MLVM2:     work = default_pg.barrier(opts=opts)
MLVM2: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: 
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2:     pretrain(train_valid_test_datasets_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2:     initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2:     pretrain(train_valid_test_datasets_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2:     _compile_dependencies()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2:     initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2:     torch.distributed.barrier()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:     _compile_dependencies()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2:     return func(*args, **kwargs)
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2:     torch.distributed.barrier()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:     return func(*args, **kwargs)
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2:     work = default_pg.barrier(opts=opts)
MLVM2: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: 
MLVM2:     work = default_pg.barrier(opts=opts)
MLVM2: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:

Thanks for the follow up. Indeed it seems it’s failing there but I don’t see any indication why in the NCCL logs. Could you rerun the code with these additional env variables?

TORCH_CPP_LOG_LEVEL=INFO TORCH_DISTRIBUTED_DEBUG=INFO TORCH_SHOW_CPP_STACKTRACES=1

Here are the logs. Not different from other logs.

MLVM: > Rank_0 done loading fused kernels!
MLVM: MLVM:6109:6109 [0] NCCL INFO Bootstrap : Using ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6109:6109 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:6109:6109 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:6109:6109 [0] NCCL INFO cudaDriverVersion 12030
MLVM: NCCL version 2.18.1+cuda12.1
MLVM2: MLVM2:5000:5000 [5] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:5000:5000 [5] NCCL INFO Bootstrap : Using ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:5000:5000 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:5000:5000 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:5002:5002 [6] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:5002:5002 [6] NCCL INFO Bootstrap : Using ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:5002:5002 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:5002:5002 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:4995:4995 [0] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:4995:4995 [0] NCCL INFO Bootstrap : Using ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:4995:4995 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:4995:4995 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:6113:6113 [4] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:6113:6113 [4] NCCL INFO Bootstrap : Using ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6113:6113 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:6113:6113 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:6110:6110 [1] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:6110:6110 [1] NCCL INFO Bootstrap : Using ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6110:6110 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:6110:6110 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:6115:6115 [6] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:6115:6115 [6] NCCL INFO Bootstrap : Using ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6115:6115 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:6115:6115 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:6118:6118 [7] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:6118:6118 [7] NCCL INFO Bootstrap : Using ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6118:6118 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:6118:6118 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:4999:4999 [4] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:4996:4996 [1] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:4999:4999 [4] NCCL INFO Bootstrap : Using ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:4996:4996 [1] NCCL INFO Bootstrap : Using ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:4999:4999 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:4999:4999 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:4996:4996 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:4996:4996 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:6112:6112 [3] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:6112:6112 [3] NCCL INFO Bootstrap : Using ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6112:6112 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:6112:6112 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:6114:6114 [5] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:6114:6114 [5] NCCL INFO Bootstrap : Using ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6114:6114 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:6114:6114 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:5004:5004 [7] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:5004:5004 [7] NCCL INFO Bootstrap : Using ibP257s454731:172.16.1.71<0>
MLVM: MLVM:6111:6111 [2] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:5004:5004 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:5004:5004 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:6111:6111 [2] NCCL INFO Bootstrap : Using ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6111:6111 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:6111:6111 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:4998:4998 [3] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:4998:4998 [3] NCCL INFO Bootstrap : Using ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:4998:4998 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:4998:4998 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:4997:4997 [2] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:4997:4997 [2] NCCL INFO Bootstrap : Using ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:4997:4997 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:4997:4997 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:6109:6771 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6109:6771 [0] NCCL INFO Using network IB
MLVM: MLVM:6114:6778 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6114:6778 [5] NCCL INFO Using network IB
MLVM: MLVM:6113:6772 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6113:6772 [4] NCCL INFO Using network IB
MLVM2: MLVM2:5000:5596 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:5000:5596 [5] NCCL INFO Using network IB
MLVM2: MLVM2:4995:5597 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:4995:5597 [0] NCCL INFO Using network IB
MLVM2: MLVM2:4997:5602 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:4997:5602 [2] NCCL INFO Using network IB
MLVM: MLVM:6111:6777 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6111:6777 [2] NCCL INFO Using network IB
MLVM: MLVM:6118:6773 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6118:6773 [7] NCCL INFO Using network IB
MLVM2: MLVM2:5002:5595 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:5002:5595 [6] NCCL INFO Using network IB
MLVM2: MLVM2:4998:5601 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:4998:5601 [3] NCCL INFO Using network IB
MLVM: MLVM:6115:6774 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6115:6774 [6] NCCL INFO Using network IB
MLVM2: MLVM2:5004:5600 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:5004:5600 [7] NCCL INFO Using network IB
MLVM: MLVM:6112:6775 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6112:6775 [3] NCCL INFO Using network IB
MLVM2: MLVM2:4999:5598 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:4999:5598 [4] NCCL INFO Using network IB
MLVM2: MLVM2:4996:5599 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s454731:172.16.1.71<0>
MLVM2: MLVM2:4996:5599 [1] NCCL INFO Using network IB
MLVM: MLVM:6110:6776 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s474637:172.16.1.95<0>
MLVM: MLVM:6110:6776 [1] NCCL INFO Using network IB
MLVM2: MLVM2:4997:5602 [2] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:5004:5600 [7] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:4995:5597 [0] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:5004:5600 [7] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:4997:5602 [2] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:4995:5597 [0] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:5004:5600 [7] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:4997:5602 [2] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:4995:5597 [0] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:5004:5600 [7] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:4997:5602 [2] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:4995:5597 [0] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:5002:5595 [6] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:5004:5600 [7] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:4995:5597 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:4997:5602 [2] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:5002:5595 [6] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:4998:5601 [3] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:5002:5595 [6] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:4998:5601 [3] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:5002:5595 [6] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:4998:5601 [3] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:5002:5595 [6] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:4998:5601 [3] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:4998:5601 [3] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:5000:5596 [5] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:5000:5596 [5] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:5000:5596 [5] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:4996:5599 [1] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:5000:5596 [5] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:4996:5599 [1] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:5000:5596 [5] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:4996:5599 [1] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:4996:5599 [1] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:4996:5599 [1] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:5004:5004 [7] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:4999:5598 [4] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:5004:5004 [7] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:4999:5598 [4] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:4995:4995 [0] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:4999:5598 [4] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:4995:4995 [0] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:4997:4997 [2] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:4998:4998 [3] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:4999:5598 [4] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:4997:4997 [2] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:4998:4998 [3] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:5002:5002 [6] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:5002:5002 [6] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:4999:5598 [4] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:4996:4996 [1] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:4996:4996 [1] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:5000:5000 [5] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:5000:5000 [5] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:4999:4999 [4] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:4999:4999 [4] NCCL INFO group.cc:106 -> 2
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: Traceback (most recent call last):
MLVM2:       File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: pretrain(train_valid_test_datasets_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2:     pretrain(train_valid_test_datasets_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2:     pretrain(train_valid_test_datasets_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2:     pretrain(train_valid_test_datasets_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2:     pretrain(train_valid_test_datasets_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: Traceback (most recent call last):
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2:     pretrain(train_valid_test_datasets_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2:     pretrain(train_valid_test_datasets_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2:     pretrain(train_valid_test_datasets_provider,
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2:                 initialize_megatron(extra_args_provider=extra_args_provider,initialize_megatron(extra_args_provider=extra_args_provider,initialize_megatron(extra_args_provider=extra_args_provider,    
MLVM2:         initialize_megatron(extra_args_provider=extra_args_provider,    
MLVM2: 
MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,  File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2: 
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2: 
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2: 
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2:         _compile_dependencies()_compile_dependencies()
MLVM2: 
MLVM2:           File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2: _compile_dependencies()    _compile_dependencies()
MLVM2: _compile_dependencies()
MLVM2:     
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2: _compile_dependencies()      File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2: 
MLVM2: _compile_dependencies()    
MLVM2: _compile_dependencies()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2:     torch.distributed.barrier()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:     torch.distributed.barrier()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:     torch.distributed.barrier()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:             torch.distributed.barrier()torch.distributed.barrier()torch.distributed.barrier()    
MLVM2: 
MLVM2: 
MLVM2: torch.distributed.barrier()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:     torch.distributed.barrier()
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:     return func(*args, **kwargs)        
MLVM2: return func(*args, **kwargs)    return func(*args, **kwargs)
MLVM2: return func(*args, **kwargs)    
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2: 
MLVM2:       File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2: return func(*args, **kwargs)      File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2: return func(*args, **kwargs)    
MLVM2: return func(*args, **kwargs)
MLVM2: return func(*args, **kwargs)
MLVM2: 
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2:   File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2:                         work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)
MLVM2: 
MLVM2: 
MLVM2: 
MLVM2: 
MLVM2: 
MLVM2: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: 
MLVM2: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: 
MLVM2: torch.distributedtorch.distributed..DistBackendErrorDistBackendError: : torch.distributedNCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: 
MLVM2: 
MLVM2: .DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: 
MLVM2:         work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)
MLVM2: torch.distributed
MLVM2: .DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: 
MLVM2: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:
MLVM2: 
MLVM2: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
MLVM2: Last error:

I decided to experiment with the ethernet network interface and the logs are definitely different. Also instead of failing, the process hangs. Notice the last error is similar to this one from nccl but the solution there does not work for my case.

MLVM: > Rank_0 done loading fused kernels!
.
.
.
MLVM2: MLVM2:8628:9223 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.5<0>
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Using network IB
MLVM: MLVM:10661:11328 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.4<0>
MLVM: MLVM:10661:11328 [1] NCCL INFO Using network IB
MLVM: MLVM:10660:11323 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.4<0>
MLVM: MLVM:10660:11323 [0] NCCL INFO Using network IB
MLVM: MLVM:10664:11324 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.4<0>
MLVM: MLVM:10664:11324 [4] NCCL INFO Using network IB
MLVM: MLVM:10665:11325 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.4<0>
MLVM: MLVM:10665:11325 [5] NCCL INFO Using network IB
MLVM2: MLVM2:8623:9224 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.5<0>
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Using network IB
MLVM: MLVM:10663:11327 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.4<0>
MLVM: MLVM:10663:11327 [3] NCCL INFO Using network IB
MLVM2: MLVM2:8627:9227 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.5<0>
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Using network IB
MLVM: MLVM:10666:11329 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.4<0>
MLVM: MLVM:10666:11329 [6] NCCL INFO Using network IB
MLVM: MLVM:10662:11330 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.4<0>
MLVM: MLVM:10662:11330 [2] NCCL INFO Using network IB
MLVM: MLVM:10669:11326 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.4<0>
MLVM: MLVM:10669:11326 [7] NCCL INFO Using network IB
MLVM2: MLVM2:8630:9228 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.5<0>
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Using network IB
MLVM2: MLVM2:8624:9226 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.5<0>
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Using network IB
MLVM2: MLVM2:8633:9225 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.5<0>
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Using network IB
MLVM2: MLVM2:8626:9230 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.5<0>
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Using network IB
MLVM2: MLVM2:8625:9229 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.1.0.5<0>
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Using network IB
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff
MLVM2: MLVM2:8626:9230 [3] NCCL INFO NVLS multicast support is not available on dev 3
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff
MLVM2: MLVM2:8625:9229 [2] NCCL INFO NVLS multicast support is not available on dev 2
MLVM: MLVM:10669:11326 [7] NCCL INFO Setting affinity for GPU 7 to ff,fff00000
MLVM: MLVM:10669:11326 [7] NCCL INFO NVLS multicast support is not available on dev 7
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Setting affinity for GPU 5 to ff,fff00000
MLVM2: MLVM2:8628:9223 [5] NCCL INFO NVLS multicast support is not available on dev 5
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff
MLVM2: MLVM2:8624:9226 [1] NCCL INFO NVLS multicast support is not available on dev 1
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Setting affinity for GPU 7 to ff,fff00000
MLVM2: MLVM2:8633:9225 [7] NCCL INFO NVLS multicast support is not available on dev 7
MLVM: MLVM:10661:11328 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff
MLVM: MLVM:10660:11323 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff
MLVM: MLVM:10660:11323 [0] NCCL INFO NVLS multicast support is not available on dev 0
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Setting affinity for GPU 6 to ff,fff00000
MLVM: MLVM:10662:11330 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff
MLVM: MLVM:10662:11330 [2] NCCL INFO NVLS multicast support is not available on dev 2
MLVM: MLVM:10661:11328 [1] NCCL INFO NVLS multicast support is not available on dev 1
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Setting affinity for GPU 4 to ff,fff00000
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff
MLVM2: MLVM2:8623:9224 [0] NCCL INFO NVLS multicast support is not available on dev 0
MLVM: MLVM:10664:11324 [4] NCCL INFO Setting affinity for GPU 4 to ff,fff00000
MLVM: MLVM:10663:11327 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff
MLVM: MLVM:10663:11327 [3] NCCL INFO NVLS multicast support is not available on dev 3
MLVM2: MLVM2:8630:9228 [6] NCCL INFO NVLS multicast support is not available on dev 6
MLVM: MLVM:10666:11329 [6] NCCL INFO Setting affinity for GPU 6 to ff,fff00000
MLVM2: MLVM2:8627:9227 [4] NCCL INFO NVLS multicast support is not available on dev 4
MLVM: MLVM:10665:11325 [5] NCCL INFO Setting affinity for GPU 5 to ff,fff00000
MLVM: MLVM:10665:11325 [5] NCCL INFO NVLS multicast support is not available on dev 5
MLVM: MLVM:10664:11324 [4] NCCL INFO NVLS multicast support is not available on dev 4
MLVM: MLVM:10666:11329 [6] NCCL INFO NVLS multicast support is not available on dev 6
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Trees [0] 14/-1/-1->15->12 [1] 14/-1/-1->15->12
MLVM2: MLVM2:8633:9225 [7] NCCL INFO P2P Chunksize set to 131072
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Trees [0] 11/-1/-1->13->14 [1] 11/-1/-1->13->14
MLVM2: MLVM2:8628:9223 [5] NCCL INFO P2P Chunksize set to 131072
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Trees [0] 13/-1/-1->14->15 [1] 13/-1/-1->14->15
MLVM2: MLVM2:8630:9228 [6] NCCL INFO P2P Chunksize set to 131072
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Trees [0] 12/-1/-1->10->9 [1] 12/-1/-1->10->9
MLVM2: MLVM2:8625:9229 [2] NCCL INFO P2P Chunksize set to 131072
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Trees [0] 15/-1/-1->12->10 [1] 15/-1/-1->12->10
MLVM2: MLVM2:8627:9227 [4] NCCL INFO P2P Chunksize set to 131072
MLVM: MLVM:10663:11327 [3] NCCL INFO Trees [0] -1/-1/-1->3->5 [1] -1/-1/-1->3->5
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 00/02 :    0   1   2   4   7   6   5   3   8   9  10  12  15  14  13  11
MLVM: MLVM:10665:11325 [5] NCCL INFO Trees [0] 3/-1/-1->5->6 [1] 3/-1/-1->5->6
MLVM: MLVM:10665:11325 [5] NCCL INFO P2P Chunksize set to 131072
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Trees [0] 9/-1/-1->8->0 [1] 9/0/-1->8->-1
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] 10/-1/-1->9->8
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 01/02 :    0   1   2   4   7   6   5   3   8   9  10  12  15  14  13  11
MLVM: MLVM:10660:11323 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->8
MLVM: MLVM:10669:11326 [7] NCCL INFO Trees [0] 6/-1/-1->7->4 [1] 6/-1/-1->7->4
MLVM: MLVM:10662:11330 [2] NCCL INFO Trees [0] 4/-1/-1->2->1 [1] 4/-1/-1->2->1
MLVM: MLVM:10660:11323 [0] NCCL INFO P2P Chunksize set to 131072
MLVM: MLVM:10669:11326 [7] NCCL INFO P2P Chunksize set to 131072
MLVM: MLVM:10662:11330 [2] NCCL INFO P2P Chunksize set to 131072
MLVM: MLVM:10664:11324 [4] NCCL INFO Trees [0] 7/-1/-1->4->2 [1] 7/-1/-1->4->2
MLVM2: MLVM2:8623:9224 [0] NCCL INFO P2P Chunksize set to 131072
MLVM: MLVM:10666:11329 [6] NCCL INFO Trees [0] 5/-1/-1->6->7 [1] 5/-1/-1->6->7
MLVM: MLVM:10664:11324 [4] NCCL INFO P2P Chunksize set to 131072
MLVM: MLVM:10666:11329 [6] NCCL INFO P2P Chunksize set to 131072
MLVM: MLVM:10661:11328 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
MLVM: MLVM:10661:11328 [1] NCCL INFO P2P Chunksize set to 131072
MLVM2: MLVM2:8624:9226 [1] NCCL INFO P2P Chunksize set to 131072
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Trees [0] -1/-1/-1->11->13 [1] -1/-1/-1->11->13
MLVM2: MLVM2:8626:9230 [3] NCCL INFO P2P Chunksize set to 131072
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 00/0 : 0[100000] -> 1[200000] via P2P/IPC
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Channel 00/0 : 13[600000] -> 11[400000] via P2P/IPC
MLVM: MLVM:10665:11325 [5] NCCL INFO Channel 00/0 : 5[600000] -> 3[400000] via P2P/IPC
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 00/0 : 8[100000] -> 9[200000] via P2P/IPC
MLVM: MLVM:10665:11325 [5] NCCL INFO Channel 01/0 : 5[600000] -> 3[400000] via P2P/IPC
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 01/0 : 0[100000] -> 1[200000] via P2P/IPC
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 01/0 : 8[100000] -> 9[200000] via P2P/IPC
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Channel 01/0 : 13[600000] -> 11[400000] via P2P/IPC
MLVM: MLVM:10663:11327 [3] NCCL INFO Channel 00/0 : 3[400000] -> 8[100000] [send] via NET/IB/0
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Channel 00/0 : 11[400000] -> 0[100000] [send] via NET/IB/0
MLVM: MLVM:10661:11328 [1] NCCL INFO Channel 00/0 : 1[200000] -> 2[300000] via P2P/IPC
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Channel 00/0 : 9[200000] -> 10[300000] via P2P/IPC
MLVM: MLVM:10663:11327 [3] NCCL INFO Channel 01/0 : 3[400000] -> 8[100000] [send] via NET/IB/0
MLVM: MLVM:10666:11329 [6] NCCL INFO Channel 00/0 : 6[700000] -> 5[600000] via P2P/IPC
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Channel 00/0 : 14[700000] -> 13[600000] via P2P/IPC
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Channel 01/0 : 11[400000] -> 0[100000] [send] via NET/IB/0
MLVM: MLVM:10661:11328 [1] NCCL INFO Channel 01/0 : 1[200000] -> 2[300000] via P2P/IPC
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Channel 01/0 : 9[200000] -> 10[300000] via P2P/IPC
MLVM: MLVM:10666:11329 [6] NCCL INFO Channel 01/0 : 6[700000] -> 5[600000] via P2P/IPC
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Channel 01/0 : 14[700000] -> 13[600000] via P2P/IPC
MLVM: MLVM:10662:11330 [2] NCCL INFO Channel 00/0 : 2[300000] -> 4[500000] via P2P/IPC
MLVM: MLVM:10661:11328 [1] NCCL INFO Connected all rings
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Connected all rings
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Channel 00/0 : 10[300000] -> 12[500000] via P2P/IPC
MLVM: MLVM:10662:11330 [2] NCCL INFO Channel 01/0 : 2[300000] -> 4[500000] via P2P/IPC
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Channel 01/0 : 10[300000] -> 12[500000] via P2P/IPC
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 00/0 : 11[400000] -> 0[100000] [receive] via NET/IB/0
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 00/0 : 3[400000] -> 8[100000] [receive] via NET/IB/0
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 01/0 : 3[400000] -> 8[100000] [receive] via NET/IB/0
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 01/0 : 11[400000] -> 0[100000] [receive] via NET/IB/0
MLVM: MLVM:10662:11330 [2] NCCL INFO Connected all rings
MLVM: MLVM:10664:11324 [4] NCCL INFO Channel 00/0 : 4[500000] -> 7[800000] via P2P/IPC
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Channel 00/0 : 12[500000] -> 15[800000] via P2P/IPC
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Connected all rings
MLVM: MLVM:10661:11328 [1] NCCL INFO Channel 00/0 : 1[200000] -> 0[100000] via P2P/IPC
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Channel 00/0 : 9[200000] -> 8[100000] via P2P/IPC
MLVM: MLVM:10664:11324 [4] NCCL INFO Channel 01/0 : 4[500000] -> 7[800000] via P2P/IPC
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Channel 01/0 : 12[500000] -> 15[800000] via P2P/IPC
MLVM: MLVM:10661:11328 [1] NCCL INFO Channel 01/0 : 1[200000] -> 0[100000] via P2P/IPC
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Channel 01/0 : 9[200000] -> 8[100000] via P2P/IPC
MLVM: MLVM:10669:11326 [7] NCCL INFO Channel 00/0 : 7[800000] -> 6[700000] via P2P/IPC
MLVM: MLVM:10664:11324 [4] NCCL INFO Connected all rings
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Channel 00/0 : 15[800000] -> 14[700000] via P2P/IPC
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Connected all rings
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Connected all rings
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Channel 00/0 : 11[400000] -> 13[600000] via P2P/IPC
MLVM: MLVM:10663:11327 [3] NCCL INFO Connected all rings
MLVM: MLVM:10663:11327 [3] NCCL INFO Channel 00/0 : 3[400000] -> 5[600000] via P2P/IPC
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Connected all rings
MLVM: MLVM:10660:11323 [0] NCCL INFO Connected all rings
MLVM: MLVM:10669:11326 [7] NCCL INFO Channel 01/0 : 7[800000] -> 6[700000] via P2P/IPC
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Channel 01/0 : 15[800000] -> 14[700000] via P2P/IPC
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Channel 01/0 : 11[400000] -> 13[600000] via P2P/IPC
MLVM: MLVM:10663:11327 [3] NCCL INFO Channel 01/0 : 3[400000] -> 5[600000] via P2P/IPC
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 00/0 : 0[100000] -> 8[100000] [receive] via NET/IB/0
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 00/0 : 8[100000] -> 0[100000] [receive] via NET/IB/0
MLVM: MLVM:10669:11326 [7] NCCL INFO Connected all rings
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Connected all rings
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Channel 00/0 : 13[600000] -> 14[700000] via P2P/IPC
MLVM: MLVM:10665:11325 [5] NCCL INFO Connected all rings
MLVM: MLVM:10665:11325 [5] NCCL INFO Channel 00/0 : 5[600000] -> 6[700000] via P2P/IPC
MLVM: MLVM:10666:11329 [6] NCCL INFO Connected all rings
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Connected all rings
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Connected all rings
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 01/0 : 0[100000] -> 8[100000] [receive] via NET/IB/0
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 01/0 : 8[100000] -> 0[100000] [receive] via NET/IB/0
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Channel 01/0 : 13[600000] -> 14[700000] via P2P/IPC
MLVM: MLVM:10665:11325 [5] NCCL INFO Channel 01/0 : 5[600000] -> 6[700000] via P2P/IPC
MLVM: MLVM:10666:11329 [6] NCCL INFO Channel 00/0 : 6[700000] -> 7[800000] via P2P/IPC
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Channel 00/0 : 14[700000] -> 15[800000] via P2P/IPC
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 00/0 : 8[100000] -> 0[100000] [send] via NET/IB/0
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 00/0 : 0[100000] -> 8[100000] [send] via NET/IB/0
MLVM: MLVM:10666:11329 [6] NCCL INFO Channel 01/0 : 6[700000] -> 7[800000] via P2P/IPC
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Channel 01/0 : 14[700000] -> 15[800000] via P2P/IPC
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 01/0 : 0[100000] -> 8[100000] [send] via NET/IB/0
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 01/0 : 8[100000] -> 0[100000] [send] via NET/IB/0
MLVM: MLVM:10669:11326 [7] NCCL INFO Channel 00/0 : 7[800000] -> 4[500000] via P2P/IPC
MLVM: MLVM:10666:11329 [6] NCCL INFO Connected all trees
MLVM: MLVM:10666:11329 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM: MLVM:10666:11329 [6] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Channel 00/0 : 15[800000] -> 12[500000] via P2P/IPC
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Connected all trees
MLVM2: MLVM2:8630:9228 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM2: MLVM2:8630:9228 [6] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM: MLVM:10669:11326 [7] NCCL INFO Channel 01/0 : 7[800000] -> 4[500000] via P2P/IPC
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Channel 01/0 : 15[800000] -> 12[500000] via P2P/IPC
MLVM: MLVM:10664:11324 [4] NCCL INFO Channel 00/0 : 4[500000] -> 2[300000] via P2P/IPC
MLVM: MLVM:10669:11326 [7] NCCL INFO Connected all trees
MLVM: MLVM:10669:11326 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM: MLVM:10669:11326 [7] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM: MLVM:10663:11327 [3] NCCL INFO Connected all trees
MLVM: MLVM:10663:11327 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM: MLVM:10663:11327 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM: MLVM:10663:11327 [3] NCCL INFO Channel 00/1 : 3[400000] -> 4[500000] via P2P/indirect/2[300000]
MLVM: MLVM:10665:11325 [5] NCCL INFO Connected all trees
MLVM: MLVM:10665:11325 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM: MLVM:10665:11325 [5] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Connected all trees
MLVM2: MLVM2:8633:9225 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM2: MLVM2:8633:9225 [7] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Connected all trees
MLVM2: MLVM2:8626:9230 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM2: MLVM2:8626:9230 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Channel 00/1 : 11[400000] -> 12[500000] via P2P/indirect/10[300000]
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Connected all trees
MLVM2: MLVM2:8628:9223 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM2: MLVM2:8628:9223 [5] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Channel 00/0 : 12[500000] -> 10[300000] via P2P/IPC
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Connected all trees
MLVM2: MLVM2:8623:9224 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM2: MLVM2:8623:9224 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 00/1 : 8[100000] -> 12[500000] via P2P/indirect/10[300000]
MLVM: MLVM:10660:11323 [0] NCCL INFO Connected all trees
MLVM: MLVM:10660:11323 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM: MLVM:10660:11323 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 00/1 : 0[100000] -> 4[500000] via P2P/indirect/2[300000]
MLVM: MLVM:10664:11324 [4] NCCL INFO Channel 01/0 : 4[500000] -> 2[300000] via P2P/IPC
MLVM: MLVM:10663:11327 [3] NCCL INFO Channel 01/1 : 3[400000] -> 4[500000] via P2P/indirect/2[300000]
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Channel 01/0 : 12[500000] -> 10[300000] via P2P/IPC
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Channel 01/1 : 11[400000] -> 12[500000] via P2P/indirect/10[300000]
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 01/1 : 0[100000] -> 4[500000] via P2P/indirect/2[300000]
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 01/1 : 8[100000] -> 12[500000] via P2P/indirect/10[300000]
MLVM: MLVM:10662:11330 [2] NCCL INFO Channel 00/0 : 2[300000] -> 1[200000] via P2P/IPC
MLVM: MLVM:10664:11324 [4] NCCL INFO Connected all trees
MLVM: MLVM:10664:11324 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM: MLVM:10664:11324 [4] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Channel 00/0 : 10[300000] -> 9[200000] via P2P/IPC
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Connected all trees
MLVM2: MLVM2:8627:9227 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM2: MLVM2:8627:9227 [4] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM: MLVM:10662:11330 [2] NCCL INFO Channel 01/0 : 2[300000] -> 1[200000] via P2P/IPC
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Channel 01/0 : 10[300000] -> 9[200000] via P2P/IPC
MLVM: MLVM:10661:11328 [1] NCCL INFO Connected all trees
MLVM: MLVM:10661:11328 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM: MLVM:10661:11328 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM: MLVM:10661:11328 [1] NCCL INFO Channel 00/1 : 1[200000] -> 4[500000] via P2P/indirect/2[300000]
MLVM: MLVM:10662:11330 [2] NCCL INFO Connected all trees
MLVM: MLVM:10662:11330 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM: MLVM:10662:11330 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM: MLVM:10662:11330 [2] NCCL INFO Channel 00/1 : 2[300000] -> 5[600000] via P2P/indirect/3[400000]
MLVM: MLVM:10663:11327 [3] NCCL INFO Channel 00/1 : 3[400000] -> 6[700000] via P2P/indirect/5[600000]
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Channel 00/1 : 11[400000] -> 14[700000] via P2P/indirect/13[600000]
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Connected all trees
MLVM2: MLVM2:8624:9226 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
MLVM2: MLVM2:8624:9226 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Channel 00/1 : 9[200000] -> 12[500000] via P2P/indirect/10[300000]
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Channel 00/1 : 10[300000] -> 13[600000] via P2P/indirect/11[400000]
MLVM: MLVM:10661:11328 [1] NCCL INFO Channel 01/1 : 1[200000] -> 4[500000] via P2P/indirect/2[300000]
MLVM: MLVM:10662:11330 [2] NCCL INFO Channel 01/1 : 2[300000] -> 5[600000] via P2P/indirect/3[400000]
MLVM: MLVM:10663:11327 [3] NCCL INFO Channel 01/1 : 3[400000] -> 6[700000] via P2P/indirect/5[600000]
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Channel 01/1 : 9[200000] -> 12[500000] via P2P/indirect/10[300000]
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Channel 01/1 : 11[400000] -> 14[700000] via P2P/indirect/13[600000]
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Channel 01/1 : 10[300000] -> 13[600000] via P2P/indirect/11[400000]
MLVM: MLVM:10662:11330 [2] NCCL INFO Channel 00/1 : 2[300000] -> 6[700000] via P2P/indirect/0[100000]
MLVM: MLVM:10661:11328 [1] NCCL INFO Channel 00/1 : 1[200000] -> 5[600000] via P2P/indirect/3[400000]
MLVM: MLVM:10663:11327 [3] NCCL INFO Channel 00/1 : 3[400000] -> 7[800000] via P2P/indirect/5[600000]
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Channel 00/1 : 10[300000] -> 14[700000] via P2P/indirect/8[100000]
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Channel 00/1 : 11[400000] -> 15[800000] via P2P/indirect/13[600000]
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Channel 00/1 : 9[200000] -> 13[600000] via P2P/indirect/11[400000]
MLVM: MLVM:10662:11330 [2] NCCL INFO Channel 01/1 : 2[300000] -> 6[700000] via P2P/indirect/0[100000]
MLVM: MLVM:10661:11328 [1] NCCL INFO Channel 01/1 : 1[200000] -> 5[600000] via P2P/indirect/3[400000]
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Channel 01/1 : 10[300000] -> 14[700000] via P2P/indirect/8[100000]
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Channel 01/1 : 9[200000] -> 13[600000] via P2P/indirect/11[400000]
MLVM: MLVM:10663:11327 [3] NCCL INFO Channel 01/1 : 3[400000] -> 7[800000] via P2P/indirect/5[600000]
MLVM2: MLVM2:8626:9230 [3] NCCL INFO Channel 01/1 : 11[400000] -> 15[800000] via P2P/indirect/13[600000]
MLVM: MLVM:10661:11328 [1] NCCL INFO Channel 00/1 : 1[200000] -> 6[700000] via P2P/indirect/7[800000]
MLVM: MLVM:10662:11330 [2] NCCL INFO Channel 00/1 : 2[300000] -> 7[800000] via P2P/indirect/4[500000]
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Channel 00/1 : 9[200000] -> 14[700000] via P2P/indirect/15[800000]
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Channel 00/1 : 10[300000] -> 15[800000] via P2P/indirect/12[500000]
MLVM: MLVM:10661:11328 [1] NCCL INFO Channel 01/1 : 1[200000] -> 6[700000] via P2P/indirect/7[800000]
MLVM: MLVM:10662:11330 [2] NCCL INFO Channel 01/1 : 2[300000] -> 7[800000] via P2P/indirect/4[500000]
MLVM: MLVM:10664:11324 [4] NCCL INFO Channel 00/1 : 4[500000] -> 0[100000] via P2P/indirect/2[300000]
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 00/1 : 0[100000] -> 5[600000] via P2P/indirect/3[400000]
MLVM2: MLVM2:8625:9229 [2] NCCL INFO Channel 01/1 : 10[300000] -> 15[800000] via P2P/indirect/12[500000]
MLVM2: MLVM2:8624:9226 [1] NCCL INFO Channel 01/1 : 9[200000] -> 14[700000] via P2P/indirect/15[800000]
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 00/1 : 8[100000] -> 13[600000] via P2P/indirect/11[400000]
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Channel 00/1 : 12[500000] -> 8[100000] via P2P/indirect/10[300000]
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 01/1 : 0[100000] -> 5[600000] via P2P/indirect/3[400000]
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 01/1 : 8[100000] -> 13[600000] via P2P/indirect/11[400000]
MLVM: MLVM:10664:11324 [4] NCCL INFO Channel 01/1 : 4[500000] -> 0[100000] via P2P/indirect/2[300000]
MLVM: MLVM:10666:11329 [6] NCCL INFO Channel 00/1 : 6[700000] -> 1[200000] via P2P/indirect/0[100000]
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Channel 00/1 : 14[700000] -> 9[200000] via P2P/indirect/8[100000]
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Channel 01/1 : 12[500000] -> 8[100000] via P2P/indirect/10[300000]
MLVM: MLVM:10665:11325 [5] NCCL INFO Channel 00/1 : 5[600000] -> 0[100000] via P2P/indirect/3[400000]
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 00/1 : 0[100000] -> 7[800000] via P2P/indirect/6[700000]
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Channel 00/1 : 13[600000] -> 8[100000] via P2P/indirect/11[400000]
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 00/1 : 8[100000] -> 15[800000] via P2P/indirect/14[700000]
MLVM: MLVM:10665:11325 [5] NCCL INFO Channel 01/1 : 5[600000] -> 0[100000] via P2P/indirect/3[400000]
MLVM: MLVM:10660:11323 [0] NCCL INFO Channel 01/1 : 0[100000] -> 7[800000] via P2P/indirect/6[700000]
MLVM: MLVM:10666:11329 [6] NCCL INFO Channel 01/1 : 6[700000] -> 1[200000] via P2P/indirect/0[100000]
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Channel 01/1 : 13[600000] -> 8[100000] via P2P/indirect/11[400000]
MLVM2: MLVM2:8623:9224 [0] NCCL INFO Channel 01/1 : 8[100000] -> 15[800000] via P2P/indirect/14[700000]
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Channel 01/1 : 14[700000] -> 9[200000] via P2P/indirect/8[100000]
MLVM: MLVM:10669:11326 [7] NCCL INFO Channel 00/1 : 7[800000] -> 0[100000] via P2P/indirect/1[200000]
MLVM: MLVM:10666:11329 [6] NCCL INFO Channel 00/1 : 6[700000] -> 2[300000] via P2P/indirect/4[500000]
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Channel 00/1 : 15[800000] -> 8[100000] via P2P/indirect/9[200000]
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Channel 00/1 : 14[700000] -> 10[300000] via P2P/indirect/12[500000]
MLVM: MLVM:10666:11329 [6] NCCL INFO Channel 01/1 : 6[700000] -> 2[300000] via P2P/indirect/4[500000]
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Channel 01/1 : 14[700000] -> 10[300000] via P2P/indirect/12[500000]
MLVM: MLVM:10669:11326 [7] NCCL INFO Channel 01/1 : 7[800000] -> 0[100000] via P2P/indirect/1[200000]
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Channel 01/1 : 15[800000] -> 8[100000] via P2P/indirect/9[200000]
MLVM: MLVM:10669:11326 [7] NCCL INFO Channel 00/1 : 7[800000] -> 2[300000] via P2P/indirect/4[500000]
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Channel 00/1 : 15[800000] -> 10[300000] via P2P/indirect/12[500000]
MLVM: MLVM:10669:11326 [7] NCCL INFO Channel 01/1 : 7[800000] -> 2[300000] via P2P/indirect/4[500000]
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Channel 01/1 : 15[800000] -> 10[300000] via P2P/indirect/12[500000]
MLVM: MLVM:10669:11326 [7] NCCL INFO Channel 00/1 : 7[800000] -> 3[400000] via P2P/indirect/5[600000]
MLVM: MLVM:10665:11325 [5] NCCL INFO Channel 00/1 : 5[600000] -> 1[200000] via P2P/indirect/7[800000]
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Channel 00/1 : 15[800000] -> 11[400000] via P2P/indirect/13[600000]
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Channel 00/1 : 13[600000] -> 9[200000] via P2P/indirect/15[800000]
MLVM: MLVM:10669:11326 [7] NCCL INFO Channel 01/1 : 7[800000] -> 3[400000] via P2P/indirect/5[600000]
MLVM: MLVM:10665:11325 [5] NCCL INFO Channel 01/1 : 5[600000] -> 1[200000] via P2P/indirect/7[800000]
MLVM2: MLVM2:8633:9225 [7] NCCL INFO Channel 01/1 : 15[800000] -> 11[400000] via P2P/indirect/13[600000]
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Channel 01/1 : 13[600000] -> 9[200000] via P2P/indirect/15[800000]
MLVM: MLVM:10666:11329 [6] NCCL INFO Channel 00/1 : 6[700000] -> 3[400000] via P2P/indirect/5[600000]
MLVM: MLVM:10665:11325 [5] NCCL INFO Channel 00/1 : 5[600000] -> 2[300000] via P2P/indirect/4[500000]
MLVM: MLVM:10664:11324 [4] NCCL INFO Channel 00/1 : 4[500000] -> 1[200000] via P2P/indirect/2[300000]
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Channel 00/1 : 13[600000] -> 10[300000] via P2P/indirect/12[500000]
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Channel 00/1 : 12[500000] -> 9[200000] via P2P/indirect/10[300000]
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Channel 00/1 : 14[700000] -> 11[400000] via P2P/indirect/13[600000]
MLVM: MLVM:10665:11325 [5] NCCL INFO Channel 01/1 : 5[600000] -> 2[300000] via P2P/indirect/4[500000]
MLVM: MLVM:10664:11324 [4] NCCL INFO Channel 01/1 : 4[500000] -> 1[200000] via P2P/indirect/2[300000]
MLVM: MLVM:10666:11329 [6] NCCL INFO Channel 01/1 : 6[700000] -> 3[400000] via P2P/indirect/5[600000]
MLVM2: MLVM2:8628:9223 [5] NCCL INFO Channel 01/1 : 13[600000] -> 10[300000] via P2P/indirect/12[500000]
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Channel 01/1 : 12[500000] -> 9[200000] via P2P/indirect/10[300000]
MLVM2: MLVM2:8630:9228 [6] NCCL INFO Channel 01/1 : 14[700000] -> 11[400000] via P2P/indirect/13[600000]
MLVM: MLVM:10664:11324 [4] NCCL INFO Channel 00/1 : 4[500000] -> 3[400000] via P2P/indirect/5[600000]
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Channel 00/1 : 12[500000] -> 11[400000] via P2P/indirect/13[600000]
MLVM: MLVM:10664:11324 [4] NCCL INFO Channel 01/1 : 4[500000] -> 3[400000] via P2P/indirect/5[600000]
MLVM2: MLVM2:8627:9227 [4] NCCL INFO Channel 01/1 : 12[500000] -> 11[400000] via P2P/indirect/13[600000]
MLVM: MLVM:10664:11324 [4] NCCL INFO comm 0xa8d72e0 rank 4 nranks 16 cudaDev 4 busId 500000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM2: MLVM2:8627:9227 [4] NCCL INFO comm 0xa01a5b0 rank 12 nranks 16 cudaDev 4 busId 500000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM2: MLVM2:8628:9223 [5] NCCL INFO comm 0xa986d20 rank 13 nranks 16 cudaDev 5 busId 600000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM2: MLVM2:8630:9228 [6] NCCL INFO comm 0x99a5430 rank 14 nranks 16 cudaDev 6 busId 700000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM: MLVM:10662:11330 [2] NCCL INFO comm 0xa84f3a0 rank 2 nranks 16 cudaDev 2 busId 300000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM: MLVM:10666:11329 [6] NCCL INFO comm 0xa59a4e0 rank 6 nranks 16 cudaDev 6 busId 700000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM2: MLVM2:8625:9229 [2] NCCL INFO comm 0x9ef05b0 rank 10 nranks 16 cudaDev 2 busId 300000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM2: MLVM2:8623:9224 [0] NCCL INFO comm 0xa2c1120 rank 8 nranks 16 cudaDev 0 busId 100000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM: MLVM:10665:11325 [5] NCCL INFO comm 0x938dec0 rank 5 nranks 16 cudaDev 5 busId 600000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM: MLVM:10661:11328 [1] NCCL INFO comm 0x9dd94f0 rank 1 nranks 16 cudaDev 1 busId 200000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM: MLVM:10663:11327 [3] NCCL INFO comm 0xa20bc10 rank 3 nranks 16 cudaDev 3 busId 400000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM: MLVM:10669:11326 [7] NCCL INFO comm 0x99a8580 rank 7 nranks 16 cudaDev 7 busId 800000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM: MLVM:10660:11323 [0] NCCL INFO comm 0x96473c0 rank 0 nranks 16 cudaDev 0 busId 100000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM2: MLVM2:8626:9230 [3] NCCL INFO comm 0x95c4ba0 rank 11 nranks 16 cudaDev 3 busId 400000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM2: MLVM2:8633:9225 [7] NCCL INFO comm 0xa3069a0 rank 15 nranks 16 cudaDev 7 busId 800000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM2: MLVM2:8624:9226 [1] NCCL INFO comm 0xa6812d0 rank 9 nranks 16 cudaDev 1 busId 200000 commId 0xb98ea4899d9e585e - Init COMPLETE
MLVM2: 
MLVM2: MLVM2:8623:9248 [0] transport/net_ib.cc:1296 NCCL WARN NET/IB : Got completion from peer 10.1.0.4<55456> with error 12, opcode 0, len 0, vendor err 129 (Recv)
MLVM2: MLVM2:8623:9248 [0] NCCL INFO transport/net.cc:1134 -> 6
MLVM2: MLVM2:8623:9248 [0] NCCL INFO proxy.cc:679 -> 6
MLVM2: MLVM2:8623:9248 [0] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]
MLVM: 
MLVM: MLVM:10660:11348 [0] transport/net_ib.cc:1296 NCCL WARN NET/IB : Got completion from peer 10.1.0.5<51900> with error 12, opcode 0, len 0, vendor err 129 (Recv)
MLVM: MLVM:10660:11348 [0] NCCL INFO transport/net.cc:1134 -> 6
MLVM: MLVM:10660:11348 [0] NCCL INFO proxy.cc:679 -> 6
MLVM: MLVM:10660:11348 [0] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

My instinct is that the problem may be with my network interfaces rather than PyTorch.

Any suggestions?

I would run all nccl-tests from their repository to remove PyTorch from the workload and to hopefully isolate the issue easier.

Thanks @ptrblck due to a suggestion from someone at nccl I decided to go a level lower and try mpi examples.

Tasks across nodes fail with socket connection errors, while single node tasks work fine. It is clear the problem is communication between the VMs and not the software stack (PyTorch nor NCCL). For this reason, I will close this issue.

what happened to be the communication issue between your VMs @Jonathan1 ?

@jasonchitla I actually gave up on trying to make this work for the VMs and used a different cluster instead.

Are you in a similar situation?

My only suggestion would be placing the VMs within the same virtual network; this was futile for me and won’t be surprised if it is for you as well.

For your reference, I solve this issue with the environment below:
conda create -n env_name python=3.9 numpy pandas
conda activate env_name
conda install nvidia/label/cuda-11.7.0::cuds-toolkit
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.0.0+cu117.html

I think this is caused by the version conflict between cudatoolkit (NCCL) and torch.

I landed here after searching on this issue, changing NCCL_SOCKET_IFNAME is what finally helped me, taken from here:

I am facing this error. The jobs are running on one node. But face the error every time I choose more than 2 GPUs to work. If I restrict working with 2 GPUs it works fine. And the issue is not restricted to one node. I tried launching job of different node still I face the same.

[2024-07-25T01:10:59.170447Z] b9f5df8d [rank=3] || 53ab7896e3a7:816:1707 [3] NCCL INFO Channel 20/0 : 3[3] → 2[2] via P2P/IPC
[2024-07-25T01:10:59.170470Z] b9f5df8d [rank=3] || 53ab7896e3a7:816:1707 [3] NCCL INFO Channel 21/0 : 3[3] → 2[2] via P2P/IPC
[2024-07-25T01:10:59.170492Z] b9f5df8d [rank=3] || 53ab7896e3a7:816:1707 [3] NCCL INFO Channel 22/0 : 3[3] → 2[2] via P2P/IPC
[2024-07-25T01:10:59.170513Z] b9f5df8d [rank=3] || 53ab7896e3a7:816:1707 [3] NCCL INFO Channel 23/0 : 3[3] → 2[2] via P2P/IPC
[2024-07-25T01:10:59.170535Z] b9f5df8d [rank=3] || 53ab7896e3a7:816:1707 [3] NCCL INFO Connected all trees
[2024-07-25T01:10:59.170558Z] b9f5df8d [rank=3] || 53ab7896e3a7:816:1707 [3] NCCL INFO NVLS comm 0x55cf72020af0 headRank 3 nHeads 4 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 805306368
[2024-07-25T01:10:59.170579Z] b9f5df8d [rank=3] ||
[2024-07-25T01:10:59.170602Z] b9f5df8d [rank=3] || 53ab7896e3a7:816:1707 [3] transport/nvls.cc:165 NCCL WARN Cuda failure ‘system not yet initialized’
[2024-07-25T01:10:59.170624Z] b9f5df8d [rank=3] || 53ab7896e3a7:816:1707 [3] NCCL INFO transport/nvls.cc:324 → 1
[2024-07-25T01:10:59.170646Z] b9f5df8d [rank=3] || 53ab7896e3a7:816:1707 [3] NCCL INFO init.cc:1093 → 1
[2024-07-25T01:10:59.170668Z] b9f5df8d [rank=3] || 53ab7896e3a7:816:1707 [3] NCCL INFO init.cc:1358 → 1
[2024-07-25T01:10:59.170690Z] b9f5df8d [rank=3] || 53ab7896e3a7:816:1707 [3] NCCL INFO group.cc:65 → 1 [Async thread]
[2024-07-25T01:10:59.170711Z] b9f5df8d [rank=3] || 53ab7896e3a7:816:816 [3] NCCL INFO group.cc:406 → 1
[2024-07-25T01:10:59.170733Z] b9f5df8d [rank=3] || 53ab7896e3a7:816:816 [3] NCCL INFO group.cc:96 → 1

[2024-07-25T01:10:59.198876Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py”, line 566, in from_pretrained
[2024-07-25T01:10:59.198881Z] b9f5df8d [rank=2] || return model_class.from_pretrained(
[2024-07-25T01:10:59.198895Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py”, line 3462, in from_pretrained
[2024-07-25T01:10:59.198900Z] b9f5df8d [rank=2] || model = cls(config, *model_args, **model_kwargs)
[2024-07-25T01:10:59.198905Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py”, line 459, in wrapper
[2024-07-25T01:10:59.198910Z] b9f5df8d [rank=2] || f(module, *args, **kwargs)
[2024-07-25T01:10:59.198917Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py”, line 1109, in init
[2024-07-25T01:10:59.198922Z] b9f5df8d [rank=2] || self.model = LlamaModel(config)
[2024-07-25T01:10:59.198928Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py”, line 459, in wrapper
[2024-07-25T01:10:59.198947Z] b9f5df8d [rank=2] || f(module, *args, **kwargs)
[2024-07-25T01:10:59.198957Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py”, line 954, in init
[2024-07-25T01:10:59.198964Z] b9f5df8d [rank=2] || self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
[2024-07-25T01:10:59.198983Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py”, line 466, in wrapper
[2024-07-25T01:10:59.198992Z] b9f5df8d [rank=2] || self._post_init_method(module)
[2024-07-25T01:10:59.199262Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py”, line 1000, in _post_init_method
[2024-07-25T01:10:59.199275Z] b9f5df8d [rank=2] || self._zero_init_param(param)
[2024-07-25T01:10:59.199282Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py”, line 956, in _zero_init_param
[2024-07-25T01:10:59.199290Z] b9f5df8d [rank=2] || dist.broadcast(param, 0, self.get_dp_process_group())
[2024-07-25T01:10:59.199295Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py”, line 117, in log_wrapper
[2024-07-25T01:10:59.199301Z] b9f5df8d [rank=2] || return func(*args, **kwargs)
[2024-07-25T01:10:59.199306Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/deepspeed/comm/comm.py”, line 224, in broadcast
[2024-07-25T01:10:59.199312Z] b9f5df8d [rank=2] || return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[2024-07-25T01:10:59.199317Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/deepspeed/comm/torch.py”, line 196, in broadcast
[2024-07-25T01:10:59.199322Z] b9f5df8d [rank=2] || return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
[2024-07-25T01:10:59.199327Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py”, line 47, in wrapper
[2024-07-25T01:10:59.199332Z] b9f5df8d [rank=2] || return func(*args, **kwargs)
[2024-07-25T01:10:59.199339Z] b9f5df8d [rank=2] || File “/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py”, line 1900, in broadcast
[2024-07-25T01:10:59.199345Z] b9f5df8d [rank=2] || work = default_pg.broadcast([tensor], opts)
[2024-07-25T01:10:59.199351Z] b9f5df8d [rank=2] || torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.5
[2024-07-25T01:10:59.199356Z] b9f5df8d [rank=2] || ncclUnhandledCudaError: Call to CUDA function failed.
[2024-07-25T01:10:59.199361Z] b9f5df8d [rank=2] || Last error:
[2024-07-25T01:10:59.199366Z] b9f5df8d [rank=2] || Cuda failure ‘system not yet initialized’

Could you check if your system runs NVLink/NVSwitch fabric manager? If not, you might want to install it.