In my AMD GPU testing, I replaced the Welford algorithm in the batchnorm operator with the Native algorithm to generate Triton DSL, and achieved better performance in some cases. At the same time, I found that compared to Welford, an additional iteration count variable needs to be maintained, and Native uses fewer registers in a kernel, which helps to achieve better parallelism. I am confused as to why Conductor defaults to using Welford algorithm? The formula for Native is as follows:
D(x) = E(x^2) - (E(x))^2