How to apply exponential moving average decay for variables?

jef · December 6, 2017, 4:48pm

I am reading following paper. And it uses EMA decay for variables.
BI-DIRECTIONAL ATTENTION FLOW FOR MACHINE COMPREHENSION

During training, the moving averages of all weights of the model are
maintained with the exponential decay rate of 0.999.

They use TensorFlow and I found the related code of EMA.

github.com

allenai/bi-att-flow/blob/master/basic/model.py#L229


      
              for var in tf.get_collection("ema/scalar", scope=self.scope):
                  ema_var = ema.average(var)
                  tf.scalar_summary(ema_var.op.name, ema_var)
              for var in tf.get_collection("ema/vector", scope=self.scope):
                  ema_var = ema.average(var)
                  tf.histogram_summary(ema_var.op.name, ema_var)
          
              with tf.control_dependencies([ema_op]):
                  self.loss = tf.identity(self.loss)
          
          def _build_var_ema(self):
              self.var_ema = tf.train.ExponentialMovingAverage(self.config.var_decay)
              ema = self.var_ema
              ema_op = ema.apply(tf.trainable_variables())
              with tf.control_dependencies([ema_op]):
                  self.loss = tf.identity(self.loss)
          
          def get_loss(self):
              return self.loss
          
          def get_global_step(self):

In PyTorch, how do I apply EMA to Variables? In TensorFlow, there is tf.train.ExponentialMovingAverage class.
https://www.tensorflow.org/versions/r0.12/api_docs/python/train/moving_averages

alexis-jacq · December 7, 2017, 10:33am

class EMA(nn.Module):
    def __init__(self, mu):
        super(EMA, self).__init__()
        self.mu = mu
        
    def forward(self,x, last_average):
        new_average = self.mu*x + (1-self.mu)*last_average
        return new_average

ema = EMA(0.999)
x = Variable(torch.rand(5),requires_grad=True)
average = Variable(torch.zeros(5),requires_grad=True)
average = ema(x, average)

jef · December 7, 2017, 4:18pm

@alexis-jacq Thank you!
I updated your code like below. I added shadow copy mechanism to save last value as average data.
Does it make sense in PyTorch manner?

   class EMA(nn.Module):
       def __init__(self, mu):
           super(EMA, self).__init__()
           self.mu = mu
           self.shadow = {}

       def register(self, name, val):
           self.shadow[name] = val.clone()

       def forward(self, name, x):
           assert name in self.shadow
           new_average = self.mu * x + (1.0 - self.mu) * self.shadow[name]
           self.shadow[name] = new_average.clone()
           return new_average

   ema = EMA(0.999)
   for name, param in model.named_parameters():
       if param.requires_grad:
           ema.register(name, param.data)

  # in batch training loop
  # for batch in batches:
       optimizer.step()
       for name, param in model.named_parameters():
           if param.requires_grad:
                param.data = ema(name, param.data)

alexis-jacq · December 8, 2017, 3:32pm

Ok I see. In that case you don’t need a module inheritage, since you don’t want to take into account the EMA correction of your parameters while computing the next gradient. So you could simply have something like:

class EMA():
       def __init__(self, mu):
           self.mu = mu
           self.shadow = {}

       def register(self, name, val):
           self.shadow[name] = val.clone()

       def __call__(self, name, x):
           assert name in self.shadow
           new_average = self.mu * x + (1.0 - self.mu) * self.shadow[name]
           self.shadow[name] = new_average.clone()
           return new_average

   ema = EMA(0.999)
   for name, param in model.named_parameters():
       if param.requires_grad:
           ema.register(name, param.data)

  # in batch training loop
  # for batch in batches:
       optimizer.step()
       for name, param in model.named_parameters():
           if param.requires_grad:
                param.data = ema(name, param.data)

ronghanghu · May 1, 2018, 3:14pm

As far as I understand, it should be

new_average = (1.0 - self.mu) * x + self.mu * self.shadow[name]

instead, since mu is 0.999. You want to multiply 0.999 on the running average, not the input x

alexis-jacq · May 3, 2018, 11:26am

My mistake, you are right. Unfortunatly – I don’t know why – I can’t edit my previous post.

yucoian · November 17, 2018, 1:42pm

@smth
@apaszke
please help us! Thanks a lot!

Knurpsbram · April 29, 2019, 11:06am

I am confused as to how to apply EMA. Is the shadow model or the actual model used to process training items and compute gradients? If its the shadow model, then I don’t see how it’s different from using a smaller learning rate with the optimizer. Is it the case that the shadow model is only used at test time?

eason_long · June 10, 2019, 4:14am

that’s right！ there are so many wrong implements!

amsword · June 9, 2020, 12:05am

in the implementation, the moving averaged results will be used for the next iterations (last sentence). Another potential solution is only to track the moving average, but the parameters in the network is still the results from optimizer. That is, only run ema(name, param.data), but does not assign it back to param.data.

Just wondering which one is better, or which one is the commonly used one?

schow · November 8, 2021, 4:08am

Is there a reason why we are using:

self.shadow[name] = new_average.clone()

instead of

self.shadow[name] = new_average.detach().clone()

since we really don’t care about propagating the gradients to the shadow copies of the parameters.