Hello! When running a DeepSpeed
training job, I get this error: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331
The job works fine on a single node but gives this error in a multi-node setup.
Any suggestions are appreciated.
Setup
2x NDv2 VMs. 8X V100 per VM.
ibX
Infiniband IP interface for both nodes. As the logs below show, NCCL detects this interface correctly so NCCL_SOCKET_IFNAME
is not set
Software (Identical to both nodes)
torch.__version__
→ 2.1.0+cu121
torch.cuda.nccl.version()
→ (2, 18, 1)
libnccl
→ Version: 2.19.3-1+cuda12.3
Env variables
NCCL_DEBUG=INFO
Log
MLVM: MLVM:12343:12343 [0] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12343:12343 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:12343:12343 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:12343:12343 [0] NCCL INFO cudaDriverVersion 12030
MLVM: NCCL version 2.18.1+cuda12.1
MLVM: MLVM:12353:12353 [7] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:12353:12353 [7] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12347:12347 [4] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:12353:12353 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:12353:12353 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:12347:12347 [4] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12347:12347 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:12347:12347 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:10793:10793 [1] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:10799:10799 [6] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:10793:10793 [1] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10799:10799 [6] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10799:10799 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:10799:10799 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:10793:10793 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:10793:10793 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:12350:12350 [6] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:12350:12350 [6] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>
MLVM2: MLVM2:10796:10796 [4] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:12350:12350 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:12350:12350 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:10796:10796 [4] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10796:10796 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:10796:10796 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:12346:12346 [3] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:10797:10797 [5] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:12346:12346 [3] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>
MLVM2: MLVM2:10797:10797 [5] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>
MLVM: MLVM:12348:12348 [5] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:12346:12346 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:12346:12346 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:10797:10797 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:10797:10797 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:12348:12348 [5] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12348:12348 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:12348:12348 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:12344:12344 [1] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:12344:12344 [1] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12344:12344 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:12344:12344 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:10802:10802 [7] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:10802:10802 [7] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10802:10802 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:10802:10802 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:10792:10792 [0] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:10792:10792 [0] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10792:10792 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:10792:10792 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:12345:12345 [2] NCCL INFO cudaDriverVersion 12030
MLVM: MLVM:12345:12345 [2] NCCL INFO Bootstrap : Using ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12345:12345 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM: MLVM:12345:12345 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:10795:10795 [3] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:10795:10795 [3] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10795:10795 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:10795:10795 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM2: MLVM2:10794:10794 [2] NCCL INFO cudaDriverVersion 12030
MLVM2: MLVM2:10794:10794 [2] NCCL INFO Bootstrap : Using ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10794:10794 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
MLVM2: MLVM2:10794:10794 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
MLVM: MLVM:12343:13006 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12343:13006 [0] NCCL INFO Using network IB
MLVM: MLVM:12348:13009 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12348:13009 [5] NCCL INFO Using network IB
MLVM: MLVM:12353:13008 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12353:13008 [7] NCCL INFO Using network IB
MLVM: MLVM:12347:13012 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12347:13012 [4] NCCL INFO Using network IB
MLVM: MLVM:12346:13010 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12346:13010 [3] NCCL INFO Using network IB
MLVM: MLVM:12344:13011 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12344:13011 [1] NCCL INFO Using network IB
MLVM2: MLVM2:10799:11392 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10799:11392 [6] NCCL INFO Using network IB
MLVM: MLVM:12350:13007 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12350:13007 [6] NCCL INFO Using network IB
MLVM2: MLVM2:10796:11396 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10796:11396 [4] NCCL INFO Using network IB
MLVM2: MLVM2:10802:11393 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10802:11393 [7] NCCL INFO Using network IB
MLVM2: MLVM2:10797:11395 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10797:11395 [5] NCCL INFO Using network IB
MLVM: MLVM:12345:13013 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s452603:172.16.1.94<0>
MLVM: MLVM:12345:13013 [2] NCCL INFO Using network IB
MLVM2: MLVM2:10795:11398 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10795:11398 [3] NCCL INFO Using network IB
MLVM2: MLVM2:10794:11399 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10794:11399 [2] NCCL INFO Using network IB
MLVM2: MLVM2:10792:11397 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10792:11397 [0] NCCL INFO Using network IB
MLVM2: MLVM2:10793:11394 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB ibP257s464513:172.16.1.56<0>
MLVM2: MLVM2:10793:11394 [1] NCCL INFO Using network IB
MLVM2: MLVM2:10796:11396 [4] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:10802:11393 [7] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:10797:11395 [5] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:10802:11393 [7] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:10796:11396 [4] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:10797:11395 [5] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:10802:11393 [7] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:10796:11396 [4] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:10797:11395 [5] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:10802:11393 [7] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:10796:11396 [4] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:10797:11395 [5] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:10802:11393 [7] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:10796:11396 [4] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:10797:11395 [5] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:10795:11398 [3] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:10795:11398 [3] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:10795:11398 [3] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:10793:11394 [1] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:10795:11398 [3] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:10793:11394 [1] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:10795:11398 [3] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:10793:11394 [1] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:10799:11392 [6] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:10793:11394 [1] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:10799:11392 [6] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:10793:11394 [1] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:10799:11392 [6] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:10799:11392 [6] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:10799:11392 [6] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:10792:11397 [0] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:10792:11397 [0] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:10792:11397 [0] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:10792:11397 [0] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:10794:11399 [2] NCCL INFO misc/socket.cc:564 -> 2
MLVM2: MLVM2:10792:11397 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:10794:11399 [2] NCCL INFO misc/socket.cc:615 -> 2
MLVM2: MLVM2:10794:11399 [2] NCCL INFO bootstrap.cc:270 -> 2
MLVM2: MLVM2:10794:11399 [2] NCCL INFO init.cc:1303 -> 2
MLVM2: MLVM2:10794:11399 [2] NCCL INFO group.cc:64 -> 2 [Async thread]
MLVM2: MLVM2:10797:10797 [5] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:10795:10795 [3] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:10797:10797 [5] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:10795:10795 [3] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:10796:10796 [4] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:10802:10802 [7] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:10796:10796 [4] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:10802:10802 [7] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:10794:10794 [2] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:10793:10793 [1] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:10794:10794 [2] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:10793:10793 [1] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:10799:10799 [6] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:10799:10799 [6] NCCL INFO group.cc:106 -> 2
MLVM2: MLVM2:10792:10792 [0] NCCL INFO group.cc:422 -> 2
MLVM2: MLVM2:10792:10792 [0] NCCL INFO group.cc:106 -> 2
MLVM2: Traceback (most recent call last):
MLVM2: Traceback (most recent call last):
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: Traceback (most recent call last):
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: Traceback (most recent call last):
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: Traceback (most recent call last):
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: Traceback (most recent call last):
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: pretrain(train_valid_test_datasets_provider,
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2: pretrain(train_valid_test_datasets_provider,
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2: pretrain(train_valid_test_datasets_provider, File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2:
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2: _compile_dependencies()
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2: _compile_dependencies()
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,pretrain(train_valid_test_datasets_provider,torch.distributed.barrier()
MLVM2:
MLVM2:
MLVM2: pretrain(train_valid_test_datasets_provider, File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2: torch.distributed.barrier()
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2:
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2: return func(*args, **kwargs)_compile_dependencies()return func(*args, **kwargs)
MLVM2:
MLVM2:
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2: _compile_dependencies()torch.distributed.barrier()
MLVM2:
MLVM2: _compile_dependencies()
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2: pretrain(train_valid_test_datasets_provider,
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/training.py", line 124, in pretrain
MLVM2: return func(*args, **kwargs)
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2: torch.distributed.barrier()
MLVM2: torch.distributed.barrier()
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2: initialize_megatron(extra_args_provider=extra_args_provider,
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 102, in initialize_megatron
MLVM2: return func(*args, **kwargs)
MLVM2: return func(*args, **kwargs)
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2: _compile_dependencies()
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/megatron/initialize.py", line 160, in _compile_dependencies
MLVM2: Traceback (most recent call last):
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/examples_deepspeed/MoE/../../pretrain_gpt.py", line 362, in <module>
MLVM2: torch.distributed.barrier()
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
MLVM2: return func(*args, **kwargs)
MLVM2: File "/home/azureuser/Megatron-DeepSpeed/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
MLVM2: work = default_pg.barrier(opts=opts)
MLVM2: work = default_pg.barrier(opts=opts)
MLVM2: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
MLVM2: Last error:
MLVM2:
MLVM2: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
MLVM2: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
MLVM2: Last error: