Wrong calling in ARM-based machine

ChunLin_Fang · April 11, 2023, 11:41am

Hello, I’m running Megatron-LM in my ARM-based machine with four A100 card, But The performance is not good as X86-based machine, After collecting the performance data using Nvidia NSYS, I find out that cublas-gemm is used in ARM while cutlass-gemm is used in X86，my pytorch version is 1.11 and I don’t know why this weild thing is happen. Can anybody help?

Best Wishes.

ptrblck · April 12, 2023, 1:06am

Could you explain your setup a bit more and in particular how you’ve built PyTorch?
The pip wheels and conda binaries do not support ARM nodes with GPUs, so I guess you have built PyTorch from source?
If so, which CUDA toolkit did you use?

ChunLin_Fang · April 12, 2023, 1:55am

Thanks for your quick reply. I’m using CUDA 11.3, This is my pytorch source build routine:

echo "prepare Environment"
git clone --depth 1 -b v1.11.0 https://github.com/pytorch/pytorch.git pytorch-1.11
    bc2c6edaf163b1a1330e37a6e34caf8c553e4755
cd pytorch
git submodule sync
git submodule update --init --recursive --jobs 0 2>&1 | tee git-submodule.log
conda create --name pytorch1.11 python=3.8
conda install -y pyyaml typing_extensions numpy ccache 
module use /home/share/apps/modulefiles
module load anaconda/2021.11
source activate pytorch1.11
echo "start Build"
module use /home/share/apps/modulefiles/
module load compilers/cuda/11.3.0
module load compilers/gcc/9.3.1
module load cudnn/8.2.1_cuda11.3
module load libs/nccl/2.17.1-1_cuda11.3

python3 setup.py install 2>&1 | tee compile.log
python3 setup.py sdist bdist_wheel 2>&1| tee gen-whl.log

ptrblck · April 12, 2023, 4:48am

Could you post the outputs of python -m torch.utils.collect_env in both cases and confirm the same cuDNN version is used?

ChunLin_Fang · April 12, 2023, 7:39am

The cuDNN version is 8.2 for both platform.
X86-ENV:

Collecting environment information...
PyTorch version: 1.11.0a0+gitbc2c6ed
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 9.3.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.18

Python version: 3.8.16 (default, Mar  2 2023, 03:21:46)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-XXX
Is CUDA available: True
CUDA runtime version: 11.3.58
GPU models and configuration:
GPU 0: NVIDIA A100-PCIE-40GB
GPU 1: NVIDIA A100-PCIE-40GB
GPU 2: NVIDIA A100-PCIE-40GB
GPU 3: NVIDIA A100-PCIE-40GB
GPU 4: NVIDIA A100-PCIE-40GB
GPU 5: NVIDIA A100-PCIE-40GB
GPU 6: NVIDIA A100-PCIE-40GB
GPU 7: NVIDIA A100-PCIE-40GB

Nvidia driver version: 510.47.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.24.2
[pip3] torch==1.11.0a0+gitbc2c6ed
[pip3] torchvision==0.12.0a0+9b5a3fe
[conda] blas                      1.0                         mkl    defaults
[conda] mkl                       2021.4.0           h06a4308_640    defaults
[conda] mkl-service               2.4.0            py38h7f8727e_0    defaults
[conda] mkl_fft                   1.3.1            py38hd3c417c_0    defaults
[conda] mkl_random                1.2.2            py38h51133e4_0    defaults
[conda] numpy                     1.24.2                   pypi_0    pypi
[conda] numpy-base                1.23.5           py38h31eccc5_0    defaults
[conda] torch                     1.11.0a0+gitbc2c6ed          pypi_0    pypi
[conda] torchvision               0.12.0a0+9b5a3fe          pypi_0    pypi

ARM-ENV:

Collecting environment information...
PyTorch version: 1.11.0a0+gitbc2c6ed
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Kylin Linux Advanced Server V10 (Sword) (aarch64)
GCC version: (GCC) 9.3.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.28

Python version: 3.8.16 (default, Mar  2 2023, 03:16:31)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.19.90-XXX
Is CUDA available: True
CUDA runtime version: 11.3.58
GPU models and configuration:
GPU 0: NVIDIA A100-PCIE-40GB
GPU 1: NVIDIA A100-PCIE-40GB
GPU 2: NVIDIA A100-PCIE-40GB
GPU 3: NVIDIA A100-PCIE-40GB

Nvidia driver version: 510.85.02
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.2.4
/usr/lib64/libcudnn_adv_infer.so.8.2.4
/usr/lib64/libcudnn_adv_train.so.8.2.4
/usr/lib64/libcudnn_cnn_infer.so.8.2.4
/usr/lib64/libcudnn_cnn_train.so.8.2.4
/usr/lib64/libcudnn_ops_infer.so.8.2.4
/usr/lib64/libcudnn_ops_train.so.8.2.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==1.11.0a0+gitbc2c6ed
[pip3] torchvision==0.12.0a0+9b5a3fe
[conda] numpy                     1.23.5           py38h8708280_0    http://mirrors.bfsu.edu.cn/anaconda/pkgs/main
[conda] numpy-base                1.23.5           py38h4a83355_0    http://mirrors.bfsu.edu.cn/anaconda/pkgs/main
[pip3] torch==1.11.0a0+gitbc2c6ed
[pip3] torchvision==0.12.0a0+9b5a3fe

ptrblck · April 13, 2023, 5:55am

Could you post the layer and input causing the different matmul kernel calls?

ChunLin_Fang · April 19, 2023, 3:27am

Thanks for your answer, The problem is solved after upgrade CuBlas，I guess this is a bug inside cublas.