Is there a way to know programmatically if mps is available or not ?
Something similar to torch.cuda.is_available()
or torch.cuda.device_count()
Hey!
Yes, you can check torch.backends.mps.is_available()
to check that.
There is only ever one device though, so no equivalent to device_count in the python API.
This doc MPS backend — PyTorch master documentation will be updated with that detail shortly!
Hey, the announcement says:
To get started, just install the latest Preview (Nightly) build on your Apple silicon Mac running macOS 12.3 or later with a native version (arm64) of Python.
I followed the instructions and used pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
to install torch on my mac with M1 Pro (macOS 12.4, Python 3.9 arm64). However, the installed pytorch
is still 1.11 and does not allow mps
as a device.
How may I install the latest version via pip? Thanks in advance!
I tried uninstalling pip3 uninstall torch torchvision torchaudio
and reinstalling pytorch using the recommended command and it worked.
Thanks a lot!! It worked. Didn’t know uninstall would make a difference.
Just found out that float64
are not supported.
Sad… I guess it is meant for machine learning, not really for scientific computing.
The AMX does have high-performance Float64 matrix multiplication on the CPU. And if you compare the ratio of CPU performance (FP64) to GPU performance (FP32), it’s actually better than Nvidia GPU.
- M1 Max, AMX FP64: 500 GFLOPS
- M1 Max, GPU FP32: 10,000 GFLOPS
Ratio: 20:1 in terms of FP32:FP64. In comparison, Nvidia GPUs have a 64:1 ratio and they come with x86 CPUs which only have AVX512 on the CPU at best. The fastest Intel CPU would have 8 ops/cycle * 4 GHz * 8 CPU cores = 256 GFLOPS FP64, total. That’s less than Apple.
Many thanks for the comment!
In my application case, torch.matmul
and torch.linalg.solve
are the most time-consuming part, where I got ~2.6x speed-up with M1 Pro vs. i7-11800H, and more if vs. older Intel CPUs).
However, this is no where near the speed-up from recent Nvidia GPUs (~13.5x speed-up with 130W laptop Nvidia RTX 3070 vs i7-11800H, and more if with e.g., A100).
So I was hoping for a performance boost from new release.
Is AMX already used in pytorch
? I saw you posted a question asking that and also this comment on GitHub.
Strange enough! with my M1 Pro
and Arm
version of Python, miniconda
with -c pytorch-nightly
flag did not work. I had to use the pip install command to install the nightly version.
Is AMX already used in
pytorch
? I saw you posted a question asking that and also this comment on GitHub.
I would regard whoever posted that comment with skepticism. Regardless, if PyTorch uses Accelerate, then it uses the AMX. @albanD said that PyTorch should use Accelerate by default, although the documentation does not officially confirm this is true. For example, it seems to say that other math libraries will be chosen instead of Accelerate if they already exist on the system, because they’re “faster” (not).
In my application case,
torch.matmul
andtorch.linalg.solve
are the most time-consuming part,
Regarding the linear algebra speedup, AMX exclusively does matrix multiplications. I don’t know what percentage of the linalg.solve
algorithm is matrix multiplication and what percentage is matrix factorization. If it can’t utilize the AMX, it won’t run fast.
Apple’s AMX should have the equivalent FP64 processing power to an Nvidia 3000-series GPU. The AMX is a massively powerful coprocessor that’s too big for any one CPU core to handle. That’s why each block of 4 power cores must simultaneously utilize its AMX block to squeeze out all the power. That’s also why the M1 Pro/Max, which has double the power cores of M1, has double the “AMX”. Note that regular M1 also has a second AMX block for its efficiency cores, but that has 1/3 the performance. If you want to know more, I have a long Twitter conversation with the guy who reverse-engineered the AMX and another engineer who hand-wrote assembly for Apple’s math libraries:
https://twitter.com/dougallj/status/1494643295946887169
Edit: The organization of that Twitter conversation seems to be a mess. Just look under my profile for a series of around 20 tweets with two different people in the same time frame, all talking performance gibberish.
We came up with numbers like 256 GFLOPS (FP32) per power CPU core with a 4:1 ratio of FP32:FP64. So first, PyTorch has to be multithreaded and use all the power CPU cores to reach 2000 GFLOPS FP32/500 GFLOPS FP64. Regarding Nvidia GPUs, the 3090 has 556 GFLOPS of double-precision power. That’s comparable to the AMX’s estimated 500 GFLOPS, but it’s also open to general-purpose compute such as sine and cosine functions. Furthermore, the ratio of memory bandwidth to ALU power is more conducive to high ALU utilization with the Nvidia GPU’s FP64.
Although if you really want fast FP64, then you should use a newer discrete AMD GPU. They have a 16:1 FP32:FP64 ratio with ~1.2 TFLOPS of FP64 power. The unofficial DLPrimitives backend for PyTorch would support AMD GPU acceleration, but I don’t think it supports FP64 yet. You could work with the owner to incorporate FP64 into basic GEMM, ensuring that the feature is disabled on Apple GPUs. Since the macOS OpenCL API delegates to the Metal compiler (at least on M1), you might be restricted to using your GPU on Linux and Windows.
Even crazier, look out for Intel’s Arc Alchemist GPU. Intel has a 4:1 ratio of FP32:FP64 performance. If FP32 performance is 20 TFLOPS, you could expect 5 TFLOPS of FP64 processing power. On one further note, the M1 Ultra should have double the AMX power of the M1 Max, reaching 1 TFLOPS FP64.
Can the api be torch.mps.is_available()
?
Imo, it looks a bit weird to have device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
.
The two side by side just feels inconsistent.
Hi!
i do agree with that but I think we prefer to move the (numerous) new backends into the torch.backends
namespace.
Note that both the cuda and mps backends already have the is_built()
there.
For consistency, I think we should add the is_available()
function to torch.backends.cuda
(that just does the same thing as the torch.cuda one. If you have some time to send a PR for that, you can add me as a reviewer.
May I ask what is the “recommended command” you mentioned here? Since my pytorch also does not work on mps properly.
I was referring to the announcement post:
To get started, just install the latest Preview (Nightly) build on your Apple silicon Mac running macOS 12.3 or later with a native version (arm64) of Python.
which is basically
pip3 install torch torchvision torchaudio
Thanks so much! It turns out that my macOS version is not new enough to enable mps as training device. After I update the system, it works for me!
Is there a way to do this without a crash in previous pytorch versions? E.g. 1.8.2 where there is no torch.backends.mps
?
You can do:
mps_available = hasattr(torch.backends, "mps") and torch.backends.mps.is_available()
JIT does not like it, unfortunately.
“There is only ever one device though, so no equivalent to device_count in python API.”
This is not true.
Some Mac Pro models have two GPU’s and are still widely in use. In fact, I am using one now. If Python/PyTorch correctly implements the Metal API, GPU selection and/or both GPUs by default is very achievable. Some of the apps I use that correctly use Metal have all (both) GPUs enabled by default, such as Final Cut Pro, iMovie, Blender, Affinity Photo, Pixelmator Pro, etc. Any application can use all GPU’s via Metal. Python/PyTorch can too.
This would be a great (and relatively simple) feature to add as using a very cheap (~$300 on the secondary market) Mac Pro with dual GPU’s is a great way to develop and test data parallelism locally before deploying it to larger systems.