ncclSystemError when using all_to_all() API

I compile PyTorch v1.9.0 with CUDA 11.0 and NCCL 2.10.3

NCCL 2.10.3 Upgrade

https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub && add-apt-repository “deb Index of /compute/cuda/repos/ubuntu1804/x86_64 /” && apt update && apt install -y --allow-change-held-packages libnccl2=2.10.3-1+cuda11.0 libnccl-dev=2.10.3-1+cuda11.0

compile PyTorch v1.9.0 from source

git clone GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration && cd pytorch && git checkout v1.9.0 && git submodule sync && git submodule update --init --recursive && sudo USE_SYSTEM_NCCL=1 TORCH_CUDA_ARCH_LIST=“6.0 6.1 7.0 7.5 8.0” python3 setup.py install

compile AWS-OFI-NCCL to support AWS EFA (fast cross machine communication in AWS)

git clone https://github.com/aws/aws-ofi-nccl.git $HOME/aws-ofi-nccl \
	     && cd $HOME/aws-ofi-nccl \
	     && git checkout aws  \
	     && ./autogen.sh \
	     && ./configure --prefix=$HOME/aws-ofi-nccl/install \
	        --with-libfabric=/opt/amazon/efa/ \
	        --with-cuda=/usr/local/cuda \
	        --with-nccl=/tmp/pytorch/build/nccl \
	        --with-mpi=/opt/amazon/openmpi/ \
	     && make && make install
	     && make -j$(nproc) && make install

Got the following error when using all_to_all APIs:

10.4.22.101:   File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 310, in forward
10.4.22.101:     expert_inputs = self.Shuffle(torch.cat(route_inputs))
10.4.22.101:   File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 508, in Shuffle
10.4.22.101:     return _Shuffle.apply(x)
10.4.22.101:   File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 398, in forward
10.4.22.101:     return _shuffle(input_)
10.4.22.101:   File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 202, in _shuffle
10.4.22.101:     output_tensor_list, input_tensor_list, mpu.get_data_parallel_group()
10.4.22.101:   File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 2478, in all_to_all
10.4.22.101:     work = group.alltoall(output_tensor_list, input_tensor_list, opts)
10.4.22.101: RuntimeError: NCCL error in: /tmp/pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:38, unhandled system error, NCCL version 21.0.3
10.4.22.101: ncclSystemError: System call (socket, malloc, munmap, etc) failed.

the environment is as follows:

------------nvidia-smi------------
Mon Aug  9 22:18:40 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:16.0 Off |                    0 |
| N/A   43C    P0    43W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   42C    P0    43W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   42C    P0    45W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:19.0 Off |                    0 |
| N/A   44C    P0    45W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:00:1A.0 Off |                    0 |
| N/A   43C    P0    43W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   43C    P0    44W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   42C    P0    43W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   45C    P0    44W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
------------python3 --version------------
Python 3.6.9
------------nvcc --version------------
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
------------nvcc --version------------
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
------------python3 -c import torch; print(torch.__version__)------------
1.9.0a0+gitd69c22d
2021-08-09 22:18:41,156 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
------------python3 -c import torch;print(torch.cuda.nccl.version())------------
3003
------------collect environment------------
--2021-08-09 22:18:41--  https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16993 (17K) [text/plain]
Saving to: 'collect_env.py'

     0K .......... ......                                     100% 64.9M=0s

2021-08-09 22:18:41 (64.9 MB/s) - 'collect_env.py' saved [16993/16993]

Collecting environment information...
PyTorch version: 1.9.0a0+gitd69c22d
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.21.1
Libc version: glibc-2.25

Python version: 3.6.9 (default, Jan 26 2021, 15:33:00)  [GCC 8.4.0] (64-bit runtime)
Python platform: Linux-4.14.200-155.322.amzn2.x86_64-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: 11.0.221
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
GPU 4: Tesla V100-SXM2-32GB
GPU 5: Tesla V100-SXM2-32GB
GPU 6: Tesla V100-SXM2-32GB
GPU 7: Tesla V100-SXM2-32GB

Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] pytorch-ignite==0.4.6
[pip3] torch==1.9.0a0+gitd69c22d
[conda] Could not collect

all reduce also has such issue:

10.4.22.101: ip-10-4-22-101:744:3711 [6] NCCL INFO Channel 01 : 14[a01c0] -> 15[a01d0] via P2P/IPC/read
10.4.22.101:     self.overflow = self.overflow_checker.check_using_norm(norm_groups)
10.4.22.101:   File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/utils.py", line 102, in check_using_norm
10.4.22.101:     dist.all_reduce(cuda_overflow, op=torch.distributed.ReduceOp.MAX)
10.4.22.101:   File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1206, in all_reduce
10.4.22.101:     work = default_pg.allreduce([tensor], opts)
10.4.22.101: RuntimeError: NCCL error in: /tmp/pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 21.0.3

finally, we found the AWS EFA NCCL library doesn’t support v2.10.3 yet. It shows error:

10.4.2.84: ip-10-4-2-84:853:853 [6] find_ofi_provider:543 NCCL WARN NET/OFI Couldn't find any optimal provider
10.4.2.84: ip-10-4-2-84:848:848 [1] NCCL INFO NET/IB : No device found.
10.4.2.84: ip-10-4-2-84:848:848 [1] NCCL INFO NET/Socket : Using [0]eth0:10.4.2.84<0> [1]eth1:10.4.8.96<0> [2]eth2:10.4.28.87<0> [3]eth3:10.4.18.220<0>
10.4.2.84: ip-10-4-2-84:848:848 [1] NCCL INFO Using network Socket
10.4.2.84: ip-10-4-2-84:852:852 [5] NCCL INFO NET/IB : No device found.
10.4.2.84: ip-10-4-2-84:852:852 [5] NCCL INFO NET/Socket : Using [0]eth0:10.4.2.84<0> [1]eth1:10.4.8.96<0> [2]eth2:10.4.28.87<0> [3]eth3:10.4.18.220<0>
10.4.2.84: ip-10-4-2-84:852:852 [5] NCCL INFO Using network Socket
10.4.2.84: ip-10-4-2-84:851:851 [4] NCCL INFO NET/IB : No device found.
10.4.2.84: ip-10-4-2-84:851:851 [4] NCCL INFO NET/Socket : Using [0]eth0:10.4.2.84<0> [1]eth1:10.4.8.96<0> [2]eth2:10.4.28.87<0> [3]eth3:10.4.18.220<0>
10.4.2.84: ip-10-4-2-84:851:851 [4] NCCL INFO Using network Socket
10.4.2.84: ip-10-4-2-84:853:853 [6] NCCL INFO NET/IB : No device found.
10.4.2.84: ip-10-4-2-84:853:853 [6] NCCL INFO NET/Socket : Using [0]eth0:10.4.2.84<0> [1]eth1:10.4.8.96<0> [2]eth2:10.4.28.87<0> [3]eth3:10.4.18.220<0>
10.4.2.84: ip-10-4-2-84:853:853 [6] NCCL INFO Using network Socket
10.4.22.101: ip-10-4-22-101:743:743 [5] NCCL INFO Bootstrap : Using eth0:10.4.22.101<0>
10.4.22.101: ip-10-4-22-101:740:740 [2] NCCL INFO Bootstrap : Using eth0:10.4.22.101<0>

The following combination can make allreduce() work, but alltoall() still failed:

NCCL v2.7.8 + PyTorch v1.9.0 + CUDA 11.0

NCCL_SOCKET_IFNAME is set as “eth”, but our EFA has four eth:

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.4.2.84  netmask 255.255.224.0  broadcast 10.4.31.255
        inet6 fe80::437:f3ff:fe3a:8529  prefixlen 64  scopeid 0x20<link>
        ether 06:37:f3:3a:85:29  txqueuelen 1000  (Ethernet)
        RX packets 39803458  bytes 99075257349 (92.2 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 19278512  bytes 35978106875 (33.5 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.4.8.96  netmask 255.255.224.0  broadcast 10.4.31.255
        inet6 fe80::4a0:51ff:fecd:2c15  prefixlen 64  scopeid 0x20<link>
        ether 06:a0:51:cd:2c:15  txqueuelen 1000  (Ethernet)
        RX packets 325113  bytes 2433198989 (2.2 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 105789  bytes 7092566 (6.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.4.28.87  netmask 255.255.224.0  broadcast 10.4.31.255
        inet6 fe80::4de:c7ff:fe3b:1595  prefixlen 64  scopeid 0x20<link>
        ether 06:de:c7:3b:15:95  txqueuelen 1000  (Ethernet)
        RX packets 863069  bytes 6834097406 (6.3 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 298325  bytes 19788074 (18.8 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.4.18.220  netmask 255.255.224.0  broadcast 10.4.31.255
        inet6 fe80::461:3dff:fead:19bd  prefixlen 64  scopeid 0x20<link>
        ether 06:61:3d:ad:19:bd  txqueuelen 1000  (Ethernet)
        RX packets 860674  bytes 6832104026 (6.3 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 293308  bytes 19451440 (18.5 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

I tried NCCL v2.9.9 + PyTorch v1.9.0 + CUDA 11.0 + AWS-OFI-NCCL (aws branch), alltoall() operation still failed

10.4.22.101:   File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 310, in forward
10.4.22.101:     expert_inputs = self.Shuffle(torch.cat(route_inputs))
10.4.22.101:   File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 508, in Shuffle
10.4.22.101:     return _Shuffle.apply(x)
10.4.22.101:   File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 398, in forward
10.4.22.101:     return _shuffle(input_)
10.4.22.101:   File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 202, in _shuffle
10.4.22.101:     output_tensor_list, input_tensor_list, mpu.get_data_parallel_group()
10.4.22.101:   File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 2478, in all_to_all
10.4.22.101:     work = group.alltoall(output_tensor_list, input_tensor_list, opts)
10.4.22.101: RuntimeError: NCCL error in: /tmp/pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:38, unhandled system error, NCCL version 20.9.9

we use NCCL Test (GitHub - NVIDIA/nccl-tests: NCCL Tests) and found the following bug when using alltoall():

Starting a DeepSpeed Training
+ cd /fsx/hchaoyan/m5/nccl-tests
++ which mpirun
+ /usr/local/mpi/bin/mpirun -allow-run-as-root --mca plm_rsh_no_tree_spawn 1 -x FI_PROVIDER=efa -x NCCL_SOCKET_IFNAME=eth -x FI_EFA_USE_DEVICE_RDMA=1 -x RDMAV_FORK_SAFE=1 -x LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/opt/aws-ofi-nccl/lib:/usr/lib:/usr/local/lib:/usr/local/lib:/usr/local/mpi/lib:/usr/local/mpi/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 -x NCCL_DEBUG=WARN -bind-to none -x NCCL_MIN_NCHANNELS=8 -x NCCL_ALGO=Ring -x OMP_NUM_THREADS=8 -x NCCL_NSOCKS_PERTHREAD=8 -x NCCL_SOCKET_NTHREADS=8 -n 16 -N 8 --mca pml '^cm' --hostfile /job/hostfile -mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 ./build/alltoall_perf -b 0.5G -e 2G -f 2 -g 1 -c 1 -n 10
Warning: Permanently added '[10.4.22.101]:2022' (RSA) to the list of known hosts.
# nThread 1 nGpus 1 minBytes 536870912 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 10 validation: 1 
#
# Using devices
#   Rank  0 Pid    311 on ip-10-4-2-84 device  0 [0x10] A100-SXM4-40GB
#   Rank  1 Pid    312 on ip-10-4-2-84 device  1 [0x10] A100-SXM4-40GB
#   Rank  2 Pid    313 on ip-10-4-2-84 device  2 [0x20] A100-SXM4-40GB
#   Rank  3 Pid    314 on ip-10-4-2-84 device  3 [0x20] A100-SXM4-40GB
#   Rank  4 Pid    315 on ip-10-4-2-84 device  4 [0x90] A100-SXM4-40GB
#   Rank  5 Pid    316 on ip-10-4-2-84 device  5 [0x90] A100-SXM4-40GB
#   Rank  6 Pid    319 on ip-10-4-2-84 device  6 [0xa0] A100-SXM4-40GB
#   Rank  7 Pid    321 on ip-10-4-2-84 device  7 [0xa0] A100-SXM4-40GB
#   Rank  8 Pid    286 on ip-10-4-22-101 device  0 [0x10] A100-SXM4-40GB
#   Rank  9 Pid    287 on ip-10-4-22-101 device  1 [0x10] A100-SXM4-40GB
#   Rank 10 Pid    288 on ip-10-4-22-101 device  2 [0x20] A100-SXM4-40GB
#   Rank 11 Pid    289 on ip-10-4-22-101 device  3 [0x20] A100-SXM4-40GB
#   Rank 12 Pid    290 on ip-10-4-22-101 device  4 [0x90] A100-SXM4-40GB
#   Rank 13 Pid    291 on ip-10-4-22-101 device  5 [0x90] A100-SXM4-40GB
#   Rank 14 Pid    292 on ip-10-4-22-101 device  6 [0xa0] A100-SXM4-40GB
#   Rank 15 Pid    296 on ip-10-4-22-101 device  7 [0xa0] A100-SXM4-40GB
NCCL version 2.9.9+cuda11.0
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

ip-10-4-22-101:286:351 [0] transport/net_socket.cc:332 NCCL WARN Call to accept failed : Too many open files
ip-10-4-22-101: Test NCCL failure alltoall.cu:76 'unhandled system error'
 .. ip-10-4-22-101 pid 286: Test failure common.cu:505
 .. ip-10-4-22-101 pid 286: Test failure common.cu:694
 .. ip-10-4-22-101 pid 286: Test failure alltoall.cu:111
 .. ip-10-4-22-101 pid 286: Test failure common.cu:722
 .. ip-10-4-22-101 pid 286: Test failure common.cu:1083
 .. ip-10-4-22-101 pid 286: Test failure common.cu:925
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[52734,1],8]
  Exit code:    3
--------------------------------------------------------------------------
+ '[' 3 -eq 0 ']'
+ log 'Writing exit code 1  to /tmp/batch-exit-code and shutting down supervisord'
+ echo 'mpi-run.sh - Writing exit code 1  to /tmp/batch-exit-code and shutting down supervisord'
mpi-run.sh - Writing exit code 1  to /tmp/batch-exit-code and shutting down supervisord
+ echo 1
++ cat /tmp/supervisord.pid
+ kill 7
+ exit 0

We finally solved this problem by enlarging the “ulimit -n” value when launching dockers.

1 Like