One big reason why I’m dead set on using Intel GPUs is my personal project, the revival of Swift for TensorFlow (S4TF). This is another ML framework like PyTorch, but different in that could theoretically run on iOS and could take drastically less time to compile. There’s going to be two possible compile options. One is the old version, which uses the TensorFlow code base as a backend and is CPU-only on macOS. The other option uses a small custom code base, is GPU-only, and runs on iOS and macOS, among other platforms. The code base can be small because system libraries (MPS and MPSGraph) contain the kernels and graph compiler. Or, in the case of OpenCL, the kernel library is DLPrimitives, which is tiny.
For something that’s GPU-only, it will be mandatory to use the Intel GPU on certain Macs. The maximum limit of ALU utilization for matrix multiplications is around 90% on Intel GPUs. This means ~350 GFLOPS of power for the Intel UHD 630. Compare that to the CPU, which is on the order of 10’s of GFLOPS. In theory, if all other bottlenecks are eliminated, most models would run faster on the Intel GPU than the CPU.
The big “if” is whether bottlenecks are eliminated. I hypothesize that CPU overhead or model configurations that underutilize the GPU are why it runs slow on PyTorch. For S4TF, I have quite extensive plans to reduce CPU overhead, leaving the only problem being models that underutilize the GPU. For example, oddly shaped matrix multiplications or convolutions that can’t use Winograd. Potentially, the entire Intel GPU architecture is terrible at ML, even the 10 TFLOPS Arc Alchemist. But that conclusion contradicts the fact that Intel invested money and time making MMX kernels for Intel GPUs in oneDNN.
We will have to wait and see why the Intel GPUs are being so slow for training, whether because of PyTorch’s design or some other fundamental problem that can’t be solved in an S4TF backend. Even if it is slower, I will definitely give the user the choice of using the CPU or GPU on Macs with only an Intel GPU.