i have written unit tests and integrations tests that show that the results are equal to python. but once i run this on the iphone the results are different.
So arm64 and amd64 will use different backends. It is quite possible that you found a bug in the arm64 one, in particular if you use less-common modules. (e.g. I had that with transposed convs a year ago on arm32, where a network would run fine on amd64 but the output was messed up on my phone.)
I know it is a lot of work, but if you want, the ideal reproducing case would be to narrow down the network to a single module where things go wrong and then provide Module+Parameters and inputs. This would also limit how much you need to tell us about it.