Apple Neural Engine (ANE) instead of / additionally to GPU on M1, M2 chips

According to the docs, MPS backend is using the GPU on M1, M2 chips via metal compute shaders.

mps device enables high-performance training on GPU for MacOS devices with Metal programming framework. It introduces a new device to map Machine Learning computational graphs and primitives on highly efficient Metal Performance Shaders Graph framework and tuned kernels provided by Metal Performance Shaders framework respectively.

The new MPS backend extends the PyTorch ecosystem and provides existing scripts capabilities to setup and run operations on GPU.

According to the following repository, ml-macos-performanc, inference on the ANE is 7x faster

densenet121_keras_applications Latency ANE : 0.0012743692083333827 RPS ANE : 784.7019478034924

densenet121_keras_applications Latency GPU : 0.008270947500000033 RPS GPU : 120.90513209036763

densenet121_keras_applications Latency CPU : 0.015347813229166719 RPS CPU : 65.15586195039286

I was wondering would the performance of training be better if we can also use the Apple’s Neural Engine?

There is obviously some restrictions, see unsupported neural engine layers, but that should be similar to Google’s TPUs.

It seems that there is maybe too much protection/security around directly calling the ANE,
according to George Hotz’s tinygrad, maybe that is the biggest blocker?

2 Likes

Hi,
thanks for the writeup; btw the tinygrad’s link gives a 404 :sweat_smile:

I have been thinking to apply FlashAttention for faster training locally on macbooks but it currently only supports cuda plus MPS is less mature with implementations afaik.

The project is in ideation stages, here.

I don’t have all the answers ofcourse, and this will be an opensource collaborative attempt. I’m researching what are the missing pieces I need to look for.

The goal is clear: “Make training faster on macbooks with Flash Attention” and may need various pieces for that: MPS, Pytorch, ANE etc.

I appreciate absolutely any help/comments/inputs on this from the community.

It seems Apples new ML framework MLX also doesn’t support the use of the ANE for inference, maybe they are still trying to workout the API, for python and C applications? Also is it possible to use the ANE for training?