Why does the inductor reduction Triton Codegen use the Welford algorithm instead of the Naive?

In my AMD GPU testing, I replaced the Welford algorithm in the batchnorm operator with the Native algorithm to generate Triton DSL, and achieved better performance in some cases. At the same time, I found that compared to Welford, an additional iteration count variable needs to be maintained, and Native uses fewer registers in a kernel, which helps to achieve better parallelism. I am confused as to why Conductor defaults to using Welford algorithm? The formula for Native is as follows:

D(x) = E(x^2) - (E(x))^2

@peterbell10 Excuse me, can you help me take a look at this :grinning: