This thread is for carrying on any discussion from:
It seems that Apple is choosing to leave Intel GPUs out of the PyTorch backend, when they could theoretically support them. For reference, on the other thread, I pointed out that Apple did the same thing with their TensorFlow backend. When it was released, I only owned an Intel Mac mini and could not run GPU-accelerated TF. Other people may feel the same way, even though M1 is more common now.
For their earliest (now archived) TF 2.4 backend with MLCompute, it crashed at runtime on the Mac mini from allocating 40 GB of virtual memory. The second backend officially removed support for Intel GPUs, which are still a large part of their consumer base.
Sorry for the inaccurate answer on the previous post.
After some more digging, you are absolutely right that this is supported in theory.
The reason why we disable it is because while doing experiments, we observed that these GPUs are not very powerful for most users and most are better off using the CPU part which will actually be faster.
And so while most users do have these processors, most of them should not use them for ML workloads.
I don’t plan on compiling PyTorch myself as that isn’t my primary ML project, but I will inject my opinion here. I think it’s a bad idea to prevent the user from accessing something. Most people won’t have the patience or experience to compile PyTorch from source and use the compiled build products ergonomically. As someone who makes software for the user, it should be up to the user to decide. Especially if someone happens to run a CPU-intensive process alongside their ML process, where the GPU would be the part of the chip that’s open to computation. This will also make your PyTorch backend stand out from the TF backend.
I think it would be best to enable support from the start, then disable it if there’s a strong signal from people to do so. I recommend that you put a warning in the PyTorch docs saying “this may be slow on Intel GPUs”. Or at the very least, put a large notice telling Intel Mac users how to compile PyTorch from source if they want to test an Intel Mac GPU.
Edit: It would also be weird if you have a script on macOS that tries to profile the GPU or use the GPU in some way, only to have the framework disable acceleration when you switch between your Apple and Intel Mac. Maybe you could provide a hidden or documented option to re-enable execution on the Intel device through the Python API. It should be extremely simple to add that feature to PyTorch - just a conditional statement surrounding your cited ØƀʄɛɕẗĮⱴə-Ƈ code. Although I’m not going to make a PR to do so myself.
I concur with @philipturner. This should be built into the library itself. PyTorch isn’t an end-user product. It should allow its developers to do what they want. Especially when this situation could easily be controlled by a simple boolean check. Recompiling the library seems like overkill for this purpose.
One big reason why I’m dead set on using Intel GPUs is my personal project, the revival of Swift for TensorFlow (S4TF). This is another ML framework like PyTorch, but different in that could theoretically run on iOS and could take drastically less time to compile. There’s going to be two possible compile options. One is the old version, which uses the TensorFlow code base as a backend and is CPU-only on macOS. The other option uses a small custom code base, is GPU-only, and runs on iOS and macOS, among other platforms. The code base can be small because system libraries (MPS and MPSGraph) contain the kernels and graph compiler. Or, in the case of OpenCL, the kernel library is DLPrimitives, which is tiny.
For something that’s GPU-only, it will be mandatory to use the Intel GPU on certain Macs. The maximum limit of ALU utilization for matrix multiplications is around 90% on Intel GPUs. This means ~350 GFLOPS of power for the Intel UHD 630. Compare that to the CPU, which is on the order of 10’s of GFLOPS. In theory, if all other bottlenecks are eliminated, most models would run faster on the Intel GPU than the CPU.
The big “if” is whether bottlenecks are eliminated. I hypothesize that CPU overhead or model configurations that underutilize the GPU are why it runs slow on PyTorch. For S4TF, I have quite extensive plans to reduce CPU overhead, leaving the only problem being models that underutilize the GPU. For example, oddly shaped matrix multiplications or convolutions that can’t use Winograd. Potentially, the entire Intel GPU architecture is terrible at ML, even the 10 TFLOPS Arc Alchemist. But that conclusion contradicts the fact that Intel invested money and time making MMX kernels for Intel GPUs in oneDNN.
We will have to wait and see why the Intel GPUs are being so slow for training, whether because of PyTorch’s design or some other fundamental problem that can’t be solved in an S4TF backend. Even if it is slower, I will definitely give the user the choice of using the CPU or GPU on Macs with only an Intel GPU.
@albanD I’m curious about how bad the Intel GPU was during internal benchmarks. Before getting into this, I have a few questions:
Did you test only the 400-GFLOPS UHD 630, or also the 800-GFLOPS Iris Plus? The second processor has 35% of the FLOPS of a 7-core M1, with relatively similar ALU utilization during matrix multiplications. It should also have identical main memory bandwidth.
Did you try using shared memory on Intel iGPUs, which would bring performance closer to Apple iGPUs?
Did you say the Intel iGPU was slower than single-core CPU or multi-core CPU?
Let’s say that someone can only use operators available to MPS. They can’t process double-precision numbers either. They run every single operation on the GPU. Based on your benchmarks, what is the performance delta of ____ compared to single-core CPU?
Apple integrated GPU
Intel integrated GPU
Intel Macs don’t have AMX, so CPU matrix multiplications are considerably slower. If you could provide both average and worst-case metrics, that would be just what I’m looking for.
I’m asking this because a GPU backend I’m developing for machine learning is GPU-only. Removing CPU operations makes my code base smaller and more maintainable. In an era where exponential growth in processing power comes from greater parallelization, single-core CPU is becoming increasingly obsolete. That is why I’m pursuing such intense latency optimizations described in Sequential throughput of GPU execution. I have to make ML operators run as fast as possible on an Intel iGPU, because I cannot run them on the CPU.
I would argue that this is problematic because PyTorch is an end-user product. Most clients don’t have the knowledge or experience with Git/command-line to compile PyTorch. They might not even know that Objective-C exists, and Python is their first programming language. Are we telling them that because of their lack of experience, they don’t have the right to test their iGPU for machine learning? Even if it is slower, they lack access to appropriate tools for proving that it is slower and reproducing that proof themself. These are concepts we take for granted in the field of science, where reproducibility is mandatory.
This is something Apple benefits from, because the only other options are either (1) upgrade to an M1 Mac or (2) switch to PC and get a cheaper Nvidia GPU with tensor cores. Now what if they are a teenager and can’t muster up hundreds over a thousand dollars to upgrade their hardware, because their parents aren’t giving them that stuff for free? I have been in this exact position before. I had a powerful Apple GPU, and made a whole research paper centered on it. But the M1-family GPU was on my iPhone, not my Mac.