Essentially identical training time on CPU/MPS. Unconvinced I'm using MPS at all

I realize there are ongoing explorations of whether MPS offers much speed up or even if it is slower than CPU in some circumstances. My question isn’t about that. My concern is that I get nearly identical training times (and nearly identical loss/accuracy curves) on CPU and MPS. And yet, if I print out the model device and the batch device I correctly get CPU for a CPU run and MPS for an MPS run.

I’m currently running on a 2020 M1 Mac Mini, which admittedly only has 8 GPU cores to compete with its 8 CPU cores. I will soon have access to an M1 Max Studio with 10 CPU and 24 GPU cores and perhaps I will get different results on that machine. But for the time being, I’m getting this strange result on an M1 Mac Mini: not just a GPU that underperforms the CPU, but which actually trains almost identically to the CPU such that I’m not confident I’m actually using the GPU at all, despite the fact that printing the device of the model and data confirms that I have moved to the MPS successfully.

Any thoughts?