I compile PyTorch v1.9.0 with CUDA 11.0 and NCCL 2.10.3
NCCL 2.10.3 Upgrade
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub && add-apt-repository “deb Index of /compute/cuda/repos/ubuntu1804/x86_64 /” && apt update && apt install -y --allow-change-held-packages libnccl2=2.10.3-1+cuda11.0 libnccl-dev=2.10.3-1+cuda11.0
compile PyTorch v1.9.0 from source
git clone GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration && cd pytorch && git checkout v1.9.0 && git submodule sync && git submodule update --init --recursive && sudo USE_SYSTEM_NCCL=1 TORCH_CUDA_ARCH_LIST=“6.0 6.1 7.0 7.5 8.0” python3 setup.py install
compile AWS-OFI-NCCL to support AWS EFA (fast cross machine communication in AWS)
git clone https://github.com/aws/aws-ofi-nccl.git $HOME/aws-ofi-nccl \
&& cd $HOME/aws-ofi-nccl \
&& git checkout aws \
&& ./autogen.sh \
&& ./configure --prefix=$HOME/aws-ofi-nccl/install \
--with-libfabric=/opt/amazon/efa/ \
--with-cuda=/usr/local/cuda \
--with-nccl=/tmp/pytorch/build/nccl \
--with-mpi=/opt/amazon/openmpi/ \
&& make && make install
&& make -j$(nproc) && make install
Got the following error when using all_to_all APIs:
10.4.22.101: File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 310, in forward
10.4.22.101: expert_inputs = self.Shuffle(torch.cat(route_inputs))
10.4.22.101: File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 508, in Shuffle
10.4.22.101: return _Shuffle.apply(x)
10.4.22.101: File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 398, in forward
10.4.22.101: return _shuffle(input_)
10.4.22.101: File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 202, in _shuffle
10.4.22.101: output_tensor_list, input_tensor_list, mpu.get_data_parallel_group()
10.4.22.101: File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 2478, in all_to_all
10.4.22.101: work = group.alltoall(output_tensor_list, input_tensor_list, opts)
10.4.22.101: RuntimeError: NCCL error in: /tmp/pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:38, unhandled system error, NCCL version 21.0.3
10.4.22.101: ncclSystemError: System call (socket, malloc, munmap, etc) failed.
chaoyanghe
(Chaoyang He)
2
the environment is as follows:
------------nvidia-smi------------
Mon Aug 9 22:18:40 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:16.0 Off | 0 |
| N/A 43C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:17.0 Off | 0 |
| N/A 42C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:18.0 Off | 0 |
| N/A 42C P0 45W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:19.0 Off | 0 |
| N/A 44C P0 45W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:00:1A.0 Off | 0 |
| N/A 43C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 43C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 42C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 45C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
------------python3 --version------------
Python 3.6.9
------------nvcc --version------------
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
------------nvcc --version------------
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
------------python3 -c import torch; print(torch.__version__)------------
1.9.0a0+gitd69c22d
2021-08-09 22:18:41,156 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
------------python3 -c import torch;print(torch.cuda.nccl.version())------------
3003
------------collect environment------------
--2021-08-09 22:18:41-- https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16993 (17K) [text/plain]
Saving to: 'collect_env.py'
0K .......... ...... 100% 64.9M=0s
2021-08-09 22:18:41 (64.9 MB/s) - 'collect_env.py' saved [16993/16993]
Collecting environment information...
PyTorch version: 1.9.0a0+gitd69c22d
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.21.1
Libc version: glibc-2.25
Python version: 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0] (64-bit runtime)
Python platform: Linux-4.14.200-155.322.amzn2.x86_64-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: 11.0.221
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
GPU 4: Tesla V100-SXM2-32GB
GPU 5: Tesla V100-SXM2-32GB
GPU 6: Tesla V100-SXM2-32GB
GPU 7: Tesla V100-SXM2-32GB
Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] pytorch-ignite==0.4.6
[pip3] torch==1.9.0a0+gitd69c22d
[conda] Could not collect
chaoyanghe
(Chaoyang He)
3
all reduce also has such issue:
10.4.22.101: ip-10-4-22-101:744:3711 [6] NCCL INFO Channel 01 : 14[a01c0] -> 15[a01d0] via P2P/IPC/read
10.4.22.101: self.overflow = self.overflow_checker.check_using_norm(norm_groups)
10.4.22.101: File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/utils.py", line 102, in check_using_norm
10.4.22.101: dist.all_reduce(cuda_overflow, op=torch.distributed.ReduceOp.MAX)
10.4.22.101: File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 1206, in all_reduce
10.4.22.101: work = default_pg.allreduce([tensor], opts)
10.4.22.101: RuntimeError: NCCL error in: /tmp/pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 21.0.3
chaoyanghe
(Chaoyang He)
4
finally, we found the AWS EFA NCCL library doesn’t support v2.10.3 yet. It shows error:
10.4.2.84: ip-10-4-2-84:853:853 [6] find_ofi_provider:543 NCCL WARN NET/OFI Couldn't find any optimal provider
10.4.2.84: ip-10-4-2-84:848:848 [1] NCCL INFO NET/IB : No device found.
10.4.2.84: ip-10-4-2-84:848:848 [1] NCCL INFO NET/Socket : Using [0]eth0:10.4.2.84<0> [1]eth1:10.4.8.96<0> [2]eth2:10.4.28.87<0> [3]eth3:10.4.18.220<0>
10.4.2.84: ip-10-4-2-84:848:848 [1] NCCL INFO Using network Socket
10.4.2.84: ip-10-4-2-84:852:852 [5] NCCL INFO NET/IB : No device found.
10.4.2.84: ip-10-4-2-84:852:852 [5] NCCL INFO NET/Socket : Using [0]eth0:10.4.2.84<0> [1]eth1:10.4.8.96<0> [2]eth2:10.4.28.87<0> [3]eth3:10.4.18.220<0>
10.4.2.84: ip-10-4-2-84:852:852 [5] NCCL INFO Using network Socket
10.4.2.84: ip-10-4-2-84:851:851 [4] NCCL INFO NET/IB : No device found.
10.4.2.84: ip-10-4-2-84:851:851 [4] NCCL INFO NET/Socket : Using [0]eth0:10.4.2.84<0> [1]eth1:10.4.8.96<0> [2]eth2:10.4.28.87<0> [3]eth3:10.4.18.220<0>
10.4.2.84: ip-10-4-2-84:851:851 [4] NCCL INFO Using network Socket
10.4.2.84: ip-10-4-2-84:853:853 [6] NCCL INFO NET/IB : No device found.
10.4.2.84: ip-10-4-2-84:853:853 [6] NCCL INFO NET/Socket : Using [0]eth0:10.4.2.84<0> [1]eth1:10.4.8.96<0> [2]eth2:10.4.28.87<0> [3]eth3:10.4.18.220<0>
10.4.2.84: ip-10-4-2-84:853:853 [6] NCCL INFO Using network Socket
10.4.22.101: ip-10-4-22-101:743:743 [5] NCCL INFO Bootstrap : Using eth0:10.4.22.101<0>
10.4.22.101: ip-10-4-22-101:740:740 [2] NCCL INFO Bootstrap : Using eth0:10.4.22.101<0>
chaoyanghe
(Chaoyang He)
5
The following combination can make allreduce() work, but alltoall() still failed:
NCCL v2.7.8 + PyTorch v1.9.0 + CUDA 11.0
NCCL_SOCKET_IFNAME is set as “eth”, but our EFA has four eth:
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.4.2.84 netmask 255.255.224.0 broadcast 10.4.31.255
inet6 fe80::437:f3ff:fe3a:8529 prefixlen 64 scopeid 0x20<link>
ether 06:37:f3:3a:85:29 txqueuelen 1000 (Ethernet)
RX packets 39803458 bytes 99075257349 (92.2 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 19278512 bytes 35978106875 (33.5 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.4.8.96 netmask 255.255.224.0 broadcast 10.4.31.255
inet6 fe80::4a0:51ff:fecd:2c15 prefixlen 64 scopeid 0x20<link>
ether 06:a0:51:cd:2c:15 txqueuelen 1000 (Ethernet)
RX packets 325113 bytes 2433198989 (2.2 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 105789 bytes 7092566 (6.7 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.4.28.87 netmask 255.255.224.0 broadcast 10.4.31.255
inet6 fe80::4de:c7ff:fe3b:1595 prefixlen 64 scopeid 0x20<link>
ether 06:de:c7:3b:15:95 txqueuelen 1000 (Ethernet)
RX packets 863069 bytes 6834097406 (6.3 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 298325 bytes 19788074 (18.8 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.4.18.220 netmask 255.255.224.0 broadcast 10.4.31.255
inet6 fe80::461:3dff:fead:19bd prefixlen 64 scopeid 0x20<link>
ether 06:61:3d:ad:19:bd txqueuelen 1000 (Ethernet)
RX packets 860674 bytes 6832104026 (6.3 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 293308 bytes 19451440 (18.5 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
chaoyanghe
(Chaoyang He)
6
I tried NCCL v2.9.9 + PyTorch v1.9.0 + CUDA 11.0 + AWS-OFI-NCCL (aws branch), alltoall() operation still failed
10.4.22.101: File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 310, in forward
10.4.22.101: expert_inputs = self.Shuffle(torch.cat(route_inputs))
10.4.22.101: File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 508, in Shuffle
10.4.22.101: return _Shuffle.apply(x)
10.4.22.101: File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 398, in forward
10.4.22.101: return _shuffle(input_)
10.4.22.101: File "/usr/local/lib/python3.6/dist-packages/m5_transformers/models/switch_transformers/switch_transformer_layers.py", line 202, in _shuffle
10.4.22.101: output_tensor_list, input_tensor_list, mpu.get_data_parallel_group()
10.4.22.101: File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 2478, in all_to_all
10.4.22.101: work = group.alltoall(output_tensor_list, input_tensor_list, opts)
10.4.22.101: RuntimeError: NCCL error in: /tmp/pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:38, unhandled system error, NCCL version 20.9.9
chaoyanghe
(Chaoyang He)
7
we use NCCL Test (GitHub - NVIDIA/nccl-tests: NCCL Tests) and found the following bug when using alltoall():
Starting a DeepSpeed Training
+ cd /fsx/hchaoyan/m5/nccl-tests
++ which mpirun
+ /usr/local/mpi/bin/mpirun -allow-run-as-root --mca plm_rsh_no_tree_spawn 1 -x FI_PROVIDER=efa -x NCCL_SOCKET_IFNAME=eth -x FI_EFA_USE_DEVICE_RDMA=1 -x RDMAV_FORK_SAFE=1 -x LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/opt/aws-ofi-nccl/lib:/usr/lib:/usr/local/lib:/usr/local/lib:/usr/local/mpi/lib:/usr/local/mpi/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 -x NCCL_DEBUG=WARN -bind-to none -x NCCL_MIN_NCHANNELS=8 -x NCCL_ALGO=Ring -x OMP_NUM_THREADS=8 -x NCCL_NSOCKS_PERTHREAD=8 -x NCCL_SOCKET_NTHREADS=8 -n 16 -N 8 --mca pml '^cm' --hostfile /job/hostfile -mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 ./build/alltoall_perf -b 0.5G -e 2G -f 2 -g 1 -c 1 -n 10
Warning: Permanently added '[10.4.22.101]:2022' (RSA) to the list of known hosts.
# nThread 1 nGpus 1 minBytes 536870912 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 10 validation: 1
#
# Using devices
# Rank 0 Pid 311 on ip-10-4-2-84 device 0 [0x10] A100-SXM4-40GB
# Rank 1 Pid 312 on ip-10-4-2-84 device 1 [0x10] A100-SXM4-40GB
# Rank 2 Pid 313 on ip-10-4-2-84 device 2 [0x20] A100-SXM4-40GB
# Rank 3 Pid 314 on ip-10-4-2-84 device 3 [0x20] A100-SXM4-40GB
# Rank 4 Pid 315 on ip-10-4-2-84 device 4 [0x90] A100-SXM4-40GB
# Rank 5 Pid 316 on ip-10-4-2-84 device 5 [0x90] A100-SXM4-40GB
# Rank 6 Pid 319 on ip-10-4-2-84 device 6 [0xa0] A100-SXM4-40GB
# Rank 7 Pid 321 on ip-10-4-2-84 device 7 [0xa0] A100-SXM4-40GB
# Rank 8 Pid 286 on ip-10-4-22-101 device 0 [0x10] A100-SXM4-40GB
# Rank 9 Pid 287 on ip-10-4-22-101 device 1 [0x10] A100-SXM4-40GB
# Rank 10 Pid 288 on ip-10-4-22-101 device 2 [0x20] A100-SXM4-40GB
# Rank 11 Pid 289 on ip-10-4-22-101 device 3 [0x20] A100-SXM4-40GB
# Rank 12 Pid 290 on ip-10-4-22-101 device 4 [0x90] A100-SXM4-40GB
# Rank 13 Pid 291 on ip-10-4-22-101 device 5 [0x90] A100-SXM4-40GB
# Rank 14 Pid 292 on ip-10-4-22-101 device 6 [0xa0] A100-SXM4-40GB
# Rank 15 Pid 296 on ip-10-4-22-101 device 7 [0xa0] A100-SXM4-40GB
NCCL version 2.9.9+cuda11.0
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
ip-10-4-22-101:286:351 [0] transport/net_socket.cc:332 NCCL WARN Call to accept failed : Too many open files
ip-10-4-22-101: Test NCCL failure alltoall.cu:76 'unhandled system error'
.. ip-10-4-22-101 pid 286: Test failure common.cu:505
.. ip-10-4-22-101 pid 286: Test failure common.cu:694
.. ip-10-4-22-101 pid 286: Test failure alltoall.cu:111
.. ip-10-4-22-101 pid 286: Test failure common.cu:722
.. ip-10-4-22-101 pid 286: Test failure common.cu:1083
.. ip-10-4-22-101 pid 286: Test failure common.cu:925
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[52734,1],8]
Exit code: 3
--------------------------------------------------------------------------
+ '[' 3 -eq 0 ']'
+ log 'Writing exit code 1 to /tmp/batch-exit-code and shutting down supervisord'
+ echo 'mpi-run.sh - Writing exit code 1 to /tmp/batch-exit-code and shutting down supervisord'
mpi-run.sh - Writing exit code 1 to /tmp/batch-exit-code and shutting down supervisord
+ echo 1
++ cat /tmp/supervisord.pid
+ kill 7
+ exit 0
chaoyanghe
(Chaoyang He)
8
We finally solved this problem by enlarging the “ulimit -n” value when launching dockers.
1 Like