Super slow training in M5 Mac compared to 4050 Nvidia GPU, why?

rafad900 · January 8, 2026, 8:18pm

Hello PyTorch,

So I built a simple training loop in C++ for a custom CNN that I’m working on and everything seems to be working fine. I’m training it on a windows laptop with a 4050 Nvidia GPU and have no problems. After running the code a couple times, I know how long its going to take…
However, I tried running the same program on my M5 Mac with 16 GB mem and found that it runs slower… CONSIDERABLY slower…
I know that the CPU/GPU architecture is different between x86 and ARM but still I find it hard to attribute the slow down to only that. Especially since Mac has the unified memory between the CPU/GPU. I honestly expected it to be faster than on my windows machine.

I haven’t been able to find any kind of info related directly to my question and ChatGPT doesn’t know either lol

I’m using the C++ frontend of libtorch and have read through the documentation and source code a bit but still not able to understand what is making it so slow. Does anyone have any idea on this?

github.com/rafad900/cpp_neural_network_training_pipeline

libs/ModelTrainer.cpp

3373dabb1


      
          void ModelTrainer::train_model() {
              if (model_ == nullptr) {
                  std::cerr << "NO MODEL SET UP!\n";
                  return;
              }
              if (!loaded_pretrained_weights && !pretrained_path.empty() && !pretrained_file.empty()) {
                  std::cout << "Loaded the pretrained weights\n";
                  loaded_pretrained_weights = true;
                  fs::path full_path = pretrained_path;
                  fs::path filename = pretrained_file;
                  full_path = full_path / filename;
                  torch::load(model_, full_path.string() );
              }
          
              model_->train();
              model_->to(device_);
              torch::optim::Adam optimizer = torch::optim::Adam(model_->parameters(), torch::optim::AdamOptions(learning_rate));
          
              CIFAR10 all_data = CIFAR10("/home/rafad900/Data/cifar-10-batches-bin/train");
              auto train_dataset  = CustomDataset(all_data.get_images(), all_data.get_labels()).map(torch::data::transforms::Stack<>());

This file has been truncated. show original

SchulzKilian · January 9, 2026, 2:59pm

it’s definitely going to be slower just form the CUDA optimizations, nvidia is just hard to beat there.

But check out your GPU usage, it should not be that much slower, it sounds like it might be falling back on GPU.

rafad900 · January 11, 2026, 4:58am

I realized I was missing some configuration within my CMakeLists.txt to that would enable the linking of Apple Metal Shader libraries with my executable.

if(APPLE)
    message(STATUS "Configuring Utils with Metal/MPS support for macOS")
    find_library(METAL_FRAMEWORK Metal REQUIRED)
    find_library(FOUNDATION_FRAMEWORK Foundation REQUIRED)
    find_library(QUARTZ_CORE_FRAMEWORK QuartzCore REQUIRED)
    find_library(MPS_FRAMEWORK MetalPerformanceShaders REQUIRED)
    find_library(MPS_GRAPH_FRAMEWORK MetalPerformanceShadersGraph REQUIRED)


    target_link_libraries(<EXEC_NAME> PUBLIC 
        ${METAL_FRAMEWORK}
        ${FOUNDATION_FRAMEWORK}
        ${QUARTZ_CORE_FRAMEWORK}
        ${MPS_FRAMEWORK}
        ${MPS_GRAPH_FRAMEWORK}
    )   


    enable_language(OBJCXX)
endif()

This does speed things up a bit, but even still slower than an Nvidia GPU. Is there an effort to optimize for Apple Metal?

iKnowMYABGZ · January 12, 2026, 10:03pm

It’s something I myself thought would not be such a hard task to do also, until I actually took the two systems apart virtually to see the easy answer that even apple tech 1 support might get right lol. The three major reasons are:

Lack of CUDA / DirectML support
Smaller memory footprint
Integration overhead in Metal or translation layers

You are comparing two completely dif style rigs, and the one big reason also answer to your thought/question is simply you are running a windows rig with a standalone 4050 graphics card. Which doesnt need any component for help or to be wired into something to work. Apple’s M5 guts work like this, the GPU is shared along with the CPU and neural engine, limiting peak throughput for highly parallel workloads like modern AI training or ray-tracing-heavy 3D rendering. Benchmarks like Blender, AI model training, other 3D rendering tools have shown anything GPU intensive sometimes dwarfs M series Macs when the software is not fully optimized for Metal. Conversely, tasks like native video encoding, Lightroom exports, or GPU-accelerated Apple apps may perform as well or better on the M5