Is the AMX accelerator used on Apple silicon?

philipturner · January 22, 2022, 1:55pm

From issue #47702 on the PyTorch repository, it is not yet clear whether PyTorch already uses AMX on Apple silicon to accelerate computations. It might do this because it relies on the operating system’s BLAS library, which is Accelerate on macOS. For reasons not described here, Apple has released little documentation on the AMX ever since its debut in the A13 chip.

If PyTorch does already use AMX, then that is ~1.3 TFLOPS of processing power. For comparison, the M1 GPU has 2.6 TFLOPS. The issue linked above was raised partially because PyTorch lacked hardware acceleration on Apple devices for a very long time. If AMX is in fact used and has comparable performance to GPU acceleration, then many people might want to know.

Could anyone investigate whether the AMX is being used? You may need to learn a bit of Swift, which provides direct access to Accelerate and microsecond-level precision for profiling. Note that M1 has one AMX, while M1 Pro/Max has two. Here are some helpful links for anyone who wishes to investigate this:

Swift-Colab repository (has links to Swift crash course)
Swift for PyTorch demo

philipturner · January 22, 2022, 1:58pm

Some more helpful links:

psobolewskiPhD · May 24, 2022, 10:26am

This issue suggests that AMX is used as of the MPS nightlies:

github.com/pytorch/pytorch

1.12 on M1 Pro Chip not using all the CPU cores. (device="cpu")

opened 04:07AM - 20 May 22 UTC

closed 01:47PM - 20 May 22 UTC

PkuCuipy

### 🐛 Describe the bug After updated to nightly-build PyTorch 1.12, a perform…ance test is made to compare `'mps'` over `'cpu'` as shown below: ```python import torch from tqdm import trange DTYPE = torch.float32 MAT_SIZE = 5000 DEVICE = ["cpu", "mps"][0] # it's CPU now mat = torch.randn([MAT_SIZE, MAT_SIZE], dtype=DTYPE, device=DEVICE) for i in trange(N_ITER := 100): mat @= mat # <--- Main Computation HERE print(mat[0, 0], end="") # avoid sync-issue when using 'mps' ``` It's true that `"mps"` is somehow faster than `"cpu"` on this M1-Pro chip. **However**, I soon noticed that it's not utilizing all the 10 CPU cores when device="cpu"? Specifically, `Activity Monitor.app` shows that it only use ≈200% of CPU. After further experiments, I found some interesting facts: 1. As mentioned above, `device="cpu"` on version 1.12 will not using all CPU cores on M1 Pro Chip. 2. When switching back to version 1.11, `device="cpu"` **do** take advantage of all the CPU cores. 3. Although `2.` is true, it's actually slower than 1.12! i.e. `device="cpu"` on 1.12 uses less CPU and less power yet got better performance. 4. Althogh `1.` is true, manully running N(such as 2) instances of this script will let the performance of each script drop down to 1/N of its original. (while indeed more CPU cores are scheduled and more Watts are consumed..) I'm wondering the reason of `1.` `2.` `3.` and `4.`, and am not sure whether it's some bug of PyTorch or some mistake in my experiments or in my mind. ### Versions PyTorch version: 1.12.0.dev20220518 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 12.3.1 (arm64) GCC version: Could not collect Clang version: 13.1.6 (clang-1316.0.21.2.5) CMake version: Could not collect Libc version: N/A Python version: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:14) [Clang 12.0.1 ] (64-bit runtime) Python platform: macOS-12.3.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] numpy==1.21.6 [pip3] torch==1.12.0.dev20220518 [pip3] torchlibrosa==0.0.9 [pip3] torchvision==0.9.0a0 [conda] numpy 1.21.6 py39h690d673_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge [conda] pytorch 1.12.0.dev20220518 py3.9_0 pytorch-nightly [conda] torchlibrosa 0.0.9 pypi_0 pypi [conda] torchvision 0.9.1 py39h0a40b5a_0_cpu https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge

I’ve run the same example on my vanilla M1 using a nightly and get 100% CPU use, consistent with just 1 AMX, vs 2 for the OP.

I did a cursory look through PRs and didn’t see Accelerate explicitly mentioned…

philipturner · May 24, 2022, 11:34am

I forgot to hyperlink the following comment on this thread. The AMX needs to use all 8 power cores for full utilization.