in the implementation, the moving averaged results will be used for the next iterations (last sentence). Another potential solution is only to track the moving average, but the parameters in the network is still the results from optimizer. That is, only run ema(name, param.data), but does not assign it back to param.data.
Just wondering which one is better, or which one is the commonly used one?