I am reading following paper. And it uses EMA decay for variables.
BI-DIRECTIONAL ATTENTION FLOW FOR MACHINE COMPREHENSION
During training, the moving averages of all weights of the model are
maintained with the exponential decay rate of 0.999.
They use TensorFlow and I found the related code of EMA.
In PyTorch, how do I apply EMA to Variables? In TensorFlow, there is