Is AMX already used in
pytorch
? I saw you posted a question asking that and also this comment on GitHub.
I would regard whoever posted that comment with skepticism. Regardless, if PyTorch uses Accelerate, then it uses the AMX. @albanD said that PyTorch should use Accelerate by default, although the documentation does not officially confirm this is true. For example, it seems to say that other math libraries will be chosen instead of Accelerate if they already exist on the system, because they’re “faster” (not).
In my application case,
torch.matmul
andtorch.linalg.solve
are the most time-consuming part,
Regarding the linear algebra speedup, AMX exclusively does matrix multiplications. I don’t know what percentage of the linalg.solve
algorithm is matrix multiplication and what percentage is matrix factorization. If it can’t utilize the AMX, it won’t run fast.
Apple’s AMX should have the equivalent FP64 processing power to an Nvidia 3000-series GPU. The AMX is a massively powerful coprocessor that’s too big for any one CPU core to handle. That’s why each block of 4 power cores must simultaneously utilize its AMX block to squeeze out all the power. That’s also why the M1 Pro/Max, which has double the power cores of M1, has double the “AMX”. Note that regular M1 also has a second AMX block for its efficiency cores, but that has 1/3 the performance. If you want to know more, I have a long Twitter conversation with the guy who reverse-engineered the AMX and another engineer who hand-wrote assembly for Apple’s math libraries:
https://twitter.com/dougallj/status/1494643295946887169
Edit: The organization of that Twitter conversation seems to be a mess. Just look under my profile for a series of around 20 tweets with two different people in the same time frame, all talking performance gibberish.
We came up with numbers like 256 GFLOPS (FP32) per power CPU core with a 4:1 ratio of FP32:FP64. So first, PyTorch has to be multithreaded and use all the power CPU cores to reach 2000 GFLOPS FP32/500 GFLOPS FP64. Regarding Nvidia GPUs, the 3090 has 556 GFLOPS of double-precision power. That’s comparable to the AMX’s estimated 500 GFLOPS, but it’s also open to general-purpose compute such as sine and cosine functions. Furthermore, the ratio of memory bandwidth to ALU power is more conducive to high ALU utilization with the Nvidia GPU’s FP64.
Although if you really want fast FP64, then you should use a newer discrete AMD GPU. They have a 16:1 FP32:FP64 ratio with ~1.2 TFLOPS of FP64 power. The unofficial DLPrimitives backend for PyTorch would support AMD GPU acceleration, but I don’t think it supports FP64 yet. You could work with the owner to incorporate FP64 into basic GEMM, ensuring that the feature is disabled on Apple GPUs. Since the macOS OpenCL API delegates to the Metal compiler (at least on M1), you might be restricted to using your GPU on Linux and Windows.
Even crazier, look out for Intel’s Arc Alchemist GPU. Intel has a 4:1 ratio of FP32:FP64 performance. If FP32 performance is 20 TFLOPS, you could expect 5 TFLOPS of FP64 processing power. On one further note, the M1 Ultra should have double the AMX power of the M1 Max, reaching 1 TFLOPS FP64.