From issue #47702 on the PyTorch repository, it is not yet clear whether PyTorch already uses AMX on Apple silicon to accelerate computations. It might do this because it relies on the operating system’s BLAS library, which is Accelerate on macOS. For reasons not described here, Apple has released little documentation on the AMX ever since its debut in the A13 chip.
If PyTorch does already use AMX, then that is ~1.3 TFLOPS of processing power. For comparison, the M1 GPU has 2.6 TFLOPS. The issue linked above was raised partially because PyTorch lacked hardware acceleration on Apple devices for a very long time. If AMX is in fact used and has comparable performance to GPU acceleration, then many people might want to know.
Could anyone investigate whether the AMX is being used? You may need to learn a bit of Swift, which provides direct access to Accelerate and microsecond-level precision for profiling. Note that M1 has one AMX, while M1 Pro/Max has two. Here are some helpful links for anyone who wishes to investigate this:
1 Like
This issue suggests that AMX is used as of the MPS nightlies:
opened 04:07AM - 20 May 22 UTC
closed 01:47PM - 20 May 22 UTC
### 🐛 Describe the bug
After updated to nightly-build PyTorch 1.12, a perform… ance test is made to compare `'mps'` over `'cpu'` as shown below:
```python
import torch
from tqdm import trange
DTYPE = torch.float32
MAT_SIZE = 5000
DEVICE = ["cpu", "mps"][0] # it's CPU now
mat = torch.randn([MAT_SIZE, MAT_SIZE], dtype=DTYPE, device=DEVICE)
for i in trange(N_ITER := 100):
mat @= mat # <--- Main Computation HERE
print(mat[0, 0], end="") # avoid sync-issue when using 'mps'
```
It's true that `"mps"` is somehow faster than `"cpu"` on this M1-Pro chip.
**However**, I soon noticed that it's not utilizing all the 10 CPU cores when device="cpu"?
Specifically, `Activity Monitor.app` shows that it only use ≈200% of CPU.
After further experiments, I found some interesting facts:
1. As mentioned above, `device="cpu"` on version 1.12 will not using all CPU cores on M1 Pro Chip.
2. When switching back to version 1.11, `device="cpu"` **do** take advantage of all the CPU cores.
3. Although `2.` is true, it's actually slower than 1.12! i.e. `device="cpu"` on 1.12 uses less CPU and less power yet got better performance.
4. Althogh `1.` is true, manully running N(such as 2) instances of this script will let the performance of each script drop down to 1/N of its original. (while indeed more CPU cores are scheduled and more Watts are consumed..)
I'm wondering the reason of `1.` `2.` `3.` and `4.`, and am not sure whether it's some bug of PyTorch or some mistake in my experiments or in my mind.
### Versions
PyTorch version: 1.12.0.dev20220518
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 12.3.1 (arm64)
GCC version: Could not collect
Clang version: 13.1.6 (clang-1316.0.21.2.5)
CMake version: Could not collect
Libc version: N/A
Python version: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:14) [Clang 12.0.1 ] (64-bit runtime)
Python platform: macOS-12.3.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.12.0.dev20220518
[pip3] torchlibrosa==0.0.9
[pip3] torchvision==0.9.0a0
[conda] numpy 1.21.6 py39h690d673_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] pytorch 1.12.0.dev20220518 py3.9_0 pytorch-nightly
[conda] torchlibrosa 0.0.9 pypi_0 pypi
[conda] torchvision 0.9.1 py39h0a40b5a_0_cpu https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
I’ve run the same example on my vanilla M1 using a nightly and get 100% CPU use, consistent with just 1 AMX, vs 2 for the OP.
I did a cursory look through PRs and didn’t see Accelerate explicitly mentioned…
I forgot to hyperlink the following comment on this thread. The AMX needs to use all 8 power cores for full utilization.
Is AMX already used in pytorch ? I saw you posted a question asking that and also this comment on GitHub.
I would regard whoever posted that comment with skepticism. Regardless, if PyTorch uses Accelerate, then it uses the AMX. @albanD said that PyTorch should use Accelerate by default, although the documentation does not officially confirm this is true. For example, it seems to say that other math libraries will be chosen instead of Accelerate if they already exist on the system, because the…