How to apply exponential moving average decay for variables?


#1

I am reading following paper. And it uses EMA decay for variables.
BI-DIRECTIONAL ATTENTION FLOW FOR MACHINE COMPREHENSION

During training, the moving averages of all weights of the model are
maintained with the exponential decay rate of 0.999.

They use TensorFlow and I found the related code of EMA.

In PyTorch, how do I apply EMA to Variables? In TensorFlow, there is tf.train.ExponentialMovingAverage class.
https://www.tensorflow.org/versions/r0.12/api_docs/python/train/moving_averages


(Alexis David Jacq) #2
class EMA(nn.Module):
    def __init__(self, mu):
        super(EMA, self).__init__()
        self.mu = mu
        
    def forward(self,x, last_average):
        new_average = self.mu*x + (1-self.mu)*last_average
        return new_average

ema = EMA(0.999)
x = Variable(torch.rand(5),requires_grad=True)
average = Variable(torch.zeros(5),requires_grad=True)
average = ema(x, average)

#3

@alexis-jacq Thank you!
I updated your code like below. I added shadow copy mechanism to save last value as average data.
Does it make sense in PyTorch manner?

   class EMA(nn.Module):
       def __init__(self, mu):
           super(EMA, self).__init__()
           self.mu = mu
           self.shadow = {}

       def register(self, name, val):
           self.shadow[name] = val.clone()

       def forward(self, name, x):
           assert name in self.shadow
           new_average = self.mu * x + (1.0 - self.mu) * self.shadow[name]
           self.shadow[name] = new_average.clone()
           return new_average

   ema = EMA(0.999)
   for name, param in model.named_parameters():
       if param.requires_grad:
           ema.register(name, param.data)

  # in batch training loop
  # for batch in batches:
       optimizer.step()
       for name, param in model.named_parameters():
           if param.requires_grad:
                param.data = ema(name, param.data)

(Alexis David Jacq) #4

Ok I see. In that case you don’t need a module inheritage, since you don’t want to take into account the EMA correction of your parameters while computing the next gradient. So you could simply have something like:

class EMA():
       def __init__(self, mu):
           self.mu = mu
           self.shadow = {}

       def register(self, name, val):
           self.shadow[name] = val.clone()

       def __call__(self, name, x):
           assert name in self.shadow
           new_average = self.mu * x + (1.0 - self.mu) * self.shadow[name]
           self.shadow[name] = new_average.clone()
           return new_average

   ema = EMA(0.999)
   for name, param in model.named_parameters():
       if param.requires_grad:
           ema.register(name, param.data)

  # in batch training loop
  # for batch in batches:
       optimizer.step()
       for name, param in model.named_parameters():
           if param.requires_grad:
                param.data = ema(name, param.data)