How to apply exponential moving average decay for variables?

I am reading following paper. And it uses EMA decay for variables.
BI-DIRECTIONAL ATTENTION FLOW FOR MACHINE COMPREHENSION

During training, the moving averages of all weights of the model are
maintained with the exponential decay rate of 0.999.

They use TensorFlow and I found the related code of EMA.

In PyTorch, how do I apply EMA to Variables? In TensorFlow, there is tf.train.ExponentialMovingAverage class.
https://www.tensorflow.org/versions/r0.12/api_docs/python/train/moving_averages

1 Like
class EMA(nn.Module):
    def __init__(self, mu):
        super(EMA, self).__init__()
        self.mu = mu
        
    def forward(self,x, last_average):
        new_average = self.mu*x + (1-self.mu)*last_average
        return new_average

ema = EMA(0.999)
x = Variable(torch.rand(5),requires_grad=True)
average = Variable(torch.zeros(5),requires_grad=True)
average = ema(x, average)
4 Likes

@alexis-jacq Thank you!
I updated your code like below. I added shadow copy mechanism to save last value as average data.
Does it make sense in PyTorch manner?

   class EMA(nn.Module):
       def __init__(self, mu):
           super(EMA, self).__init__()
           self.mu = mu
           self.shadow = {}

       def register(self, name, val):
           self.shadow[name] = val.clone()

       def forward(self, name, x):
           assert name in self.shadow
           new_average = self.mu * x + (1.0 - self.mu) * self.shadow[name]
           self.shadow[name] = new_average.clone()
           return new_average

   ema = EMA(0.999)
   for name, param in model.named_parameters():
       if param.requires_grad:
           ema.register(name, param.data)

  # in batch training loop
  # for batch in batches:
       optimizer.step()
       for name, param in model.named_parameters():
           if param.requires_grad:
                param.data = ema(name, param.data)

Ok I see. In that case you don’t need a module inheritage, since you don’t want to take into account the EMA correction of your parameters while computing the next gradient. So you could simply have something like:

class EMA():
       def __init__(self, mu):
           self.mu = mu
           self.shadow = {}

       def register(self, name, val):
           self.shadow[name] = val.clone()

       def __call__(self, name, x):
           assert name in self.shadow
           new_average = self.mu * x + (1.0 - self.mu) * self.shadow[name]
           self.shadow[name] = new_average.clone()
           return new_average

   ema = EMA(0.999)
   for name, param in model.named_parameters():
       if param.requires_grad:
           ema.register(name, param.data)

  # in batch training loop
  # for batch in batches:
       optimizer.step()
       for name, param in model.named_parameters():
           if param.requires_grad:
                param.data = ema(name, param.data)
9 Likes

As far as I understand, it should be

new_average = (1.0 - self.mu) * x + self.mu * self.shadow[name]

instead, since mu is 0.999. You want to multiply 0.999 on the running average, not the input x

7 Likes

My mistake, you are right. Unfortunatly – I don’t know why – I can’t edit my previous post.

1 Like

@smth
@apaszke
please help us! Thanks a lot!

I am confused as to how to apply EMA. Is the shadow model or the actual model used to process training items and compute gradients? If its the shadow model, then I don’t see how it’s different from using a smaller learning rate with the optimizer. Is it the case that the shadow model is only used at test time?

1 Like

that’s right! there are so many wrong implements!

in the implementation, the moving averaged results will be used for the next iterations (last sentence). Another potential solution is only to track the moving average, but the parameters in the network is still the results from optimizer. That is, only run ema(name, param.data), but does not assign it back to param.data.

Just wondering which one is better, or which one is the commonly used one?

Is there a reason why we are using:

self.shadow[name] = new_average.clone()

instead of

self.shadow[name] = new_average.detach().clone()

since we really don’t care about propagating the gradients to the shadow copies of the parameters.