Wrong calling in ARM-based machine

Hello, I’m running Megatron-LM in my ARM-based machine with four A100 card, But The performance is not good as X86-based machine, After collecting the performance data using Nvidia NSYS, I find out that cublas-gemm is used in ARM while cutlass-gemm is used in X86,my pytorch version is 1.11 and I don’t know why this weild thing is happen. Can anybody help?

Best Wishes.

Could you explain your setup a bit more and in particular how you’ve built PyTorch?
The pip wheels and conda binaries do not support ARM nodes with GPUs, so I guess you have built PyTorch from source?
If so, which CUDA toolkit did you use?

Thanks for your quick reply. I’m using CUDA 11.3, This is my pytorch source build routine:

echo "prepare Environment"
git clone --depth 1 -b v1.11.0 https://github.com/pytorch/pytorch.git pytorch-1.11
    bc2c6edaf163b1a1330e37a6e34caf8c553e4755
cd pytorch
git submodule sync
git submodule update --init --recursive --jobs 0 2>&1 | tee git-submodule.log
conda create --name pytorch1.11 python=3.8
conda install -y pyyaml typing_extensions numpy ccache 
module use /home/share/apps/modulefiles
module load anaconda/2021.11
source activate pytorch1.11
echo "start Build"
module use /home/share/apps/modulefiles/
module load compilers/cuda/11.3.0
module load compilers/gcc/9.3.1
module load cudnn/8.2.1_cuda11.3
module load libs/nccl/2.17.1-1_cuda11.3

python3 setup.py install 2>&1 | tee compile.log
python3 setup.py sdist bdist_wheel 2>&1| tee gen-whl.log

Could you post the outputs of python -m torch.utils.collect_env in both cases and confirm the same cuDNN version is used?

The cuDNN version is 8.2 for both platform.
X86-ENV:

Collecting environment information...
PyTorch version: 1.11.0a0+gitbc2c6ed
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 9.3.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.18

Python version: 3.8.16 (default, Mar  2 2023, 03:21:46)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-XXX
Is CUDA available: True
CUDA runtime version: 11.3.58
GPU models and configuration:
GPU 0: NVIDIA A100-PCIE-40GB
GPU 1: NVIDIA A100-PCIE-40GB
GPU 2: NVIDIA A100-PCIE-40GB
GPU 3: NVIDIA A100-PCIE-40GB
GPU 4: NVIDIA A100-PCIE-40GB
GPU 5: NVIDIA A100-PCIE-40GB
GPU 6: NVIDIA A100-PCIE-40GB
GPU 7: NVIDIA A100-PCIE-40GB

Nvidia driver version: 510.47.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.24.2
[pip3] torch==1.11.0a0+gitbc2c6ed
[pip3] torchvision==0.12.0a0+9b5a3fe
[conda] blas                      1.0                         mkl    defaults
[conda] mkl                       2021.4.0           h06a4308_640    defaults
[conda] mkl-service               2.4.0            py38h7f8727e_0    defaults
[conda] mkl_fft                   1.3.1            py38hd3c417c_0    defaults
[conda] mkl_random                1.2.2            py38h51133e4_0    defaults
[conda] numpy                     1.24.2                   pypi_0    pypi
[conda] numpy-base                1.23.5           py38h31eccc5_0    defaults
[conda] torch                     1.11.0a0+gitbc2c6ed          pypi_0    pypi
[conda] torchvision               0.12.0a0+9b5a3fe          pypi_0    pypi

ARM-ENV:

Collecting environment information...
PyTorch version: 1.11.0a0+gitbc2c6ed
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Kylin Linux Advanced Server V10 (Sword) (aarch64)
GCC version: (GCC) 9.3.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.28

Python version: 3.8.16 (default, Mar  2 2023, 03:16:31)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.19.90-XXX
Is CUDA available: True
CUDA runtime version: 11.3.58
GPU models and configuration:
GPU 0: NVIDIA A100-PCIE-40GB
GPU 1: NVIDIA A100-PCIE-40GB
GPU 2: NVIDIA A100-PCIE-40GB
GPU 3: NVIDIA A100-PCIE-40GB

Nvidia driver version: 510.85.02
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.2.4
/usr/lib64/libcudnn_adv_infer.so.8.2.4
/usr/lib64/libcudnn_adv_train.so.8.2.4
/usr/lib64/libcudnn_cnn_infer.so.8.2.4
/usr/lib64/libcudnn_cnn_train.so.8.2.4
/usr/lib64/libcudnn_ops_infer.so.8.2.4
/usr/lib64/libcudnn_ops_train.so.8.2.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==1.11.0a0+gitbc2c6ed
[pip3] torchvision==0.12.0a0+9b5a3fe
[conda] numpy                     1.23.5           py38h8708280_0    http://mirrors.bfsu.edu.cn/anaconda/pkgs/main
[conda] numpy-base                1.23.5           py38h4a83355_0    http://mirrors.bfsu.edu.cn/anaconda/pkgs/main
[pip3] torch==1.11.0a0+gitbc2c6ed
[pip3] torchvision==0.12.0a0+9b5a3fe

Could you post the layer and input causing the different matmul kernel calls?

Thanks for your answer, The problem is solved after upgrade CuBlas,I guess this is a bug inside cublas.