Exponential Moving Average in PyTorch, for weights and gradients

Do we need to apply exponential moving average to weights during training when we use Adam (or other optimizers)?

My EMA is defined as:

class EMA(object):

def __init__(self, mu):
    self.mu = mu
    self.shadow = {}

def register(self, name, val):
    self.shadow[name] = val.clone()

def __call__(self, name, x):
    assert name in self.shadow
    new_average = (1.0 - self.mu) * x + self.mu * self.shadow[name]
    self.shadow[name] = new_average.clone()
    return new_average

My code is like (during each batch):

        y1, y2 = answer_tpos[:, 0], answer_tpos[:, 1]
        loss1 = self.loss(p1, y1, size_average=True)
        loss2 = self.loss(p2, y2, size_average=True)
        loss = (loss1 + loss2) / 2

        # update learning rate
        if self.use_scheduler:
            # during warm up stage, use exponential warm up
            if self.step < self.lr_warm_up_num - 1:
            # after warm up stage, fix scheduler
            if self.step >= self.lr_warm_up_num - 1 and self.unused:
                self.optimizer.param_groups[0]['initial_lr'] = self.lr
                self.scheduler = optim.lr_scheduler.ExponentialLR(
                    self.optimizer, self.decay)
                for g in self.optimizer.param_groups:
                    g['lr'] = self.lr
                self.unused = False
            # print("Learning rate: {}".format(self.scheduler.get_lr()))
            print("Learning rate: {}".format(

        # exponential moving avarage
        if self.use_ema and self.ema is not None:
            print("Apply ema")
            for name, param in self.model.named_parameters():
                if param.requires_grad:
                    param.data = self.ema(name, param.data)

        # gradient clip
        if self.use_grad_clip:
                self.model.parameters(), self.grad_clip)

And the ema is defined (in main) as:

# set optimizer and scheduler
parameters = filter(lambda p: p.requires_grad, model.parameters())
base_lr = 1.0
optimizer = optim.Adam(
    betas=(args.beta1, args.beta2),
cr = args.lr / math.log2(args.lr_warm_up_num)
scheduler = optim.lr_scheduler.LambdaLR(
    lr_lambda=lambda ee: cr * math.log2(ee + 1)
    if ee < args.lr_warm_up_num else args.lr)

# exponential moving average
ema = EMA(args.decay)
if args.use_ema:
    for name, param in model.named_parameters():
        if param.requires_grad:
            ema.register(name, param.data)

I referred to the following discussion.

My problem is:

  1. whether my implementation is correct? It seems that after I applied ema to weights, my training performance is not good.
  2. seems that some optimizers use exponential moving average on gradients. So if they used EMA for gradients, do we need to apply EMA for weights again after optimizer?

please help us! Thanks a lot!

  1. I’m not familiar with EMA, so I cant check if the implementation is correct.

  2. It feels like they’re independent, but I’m not sure either.

@yucoian for future reference, I dont think it’s helpful to tag me without context, I dont know a lot of research questions.

Hi there! I am struggling! My main Facebook account was hacked and somehow now I got disabled. So that is just one of the minor problems. Somehow, someway, two of my largest Facebook pages they do a lot of business on were hacked into you and completely taken over. I can still see them on the Internet and have reported them and posted things to try to get people to report them to Facebook, however, I am seeing nothing and no results. I was removed as an admin on my Pages and I don’t even know how. Can you guys please help me? Thank you so much!

Hi @eplato,
as this is the PyTorch discussion board, I doubt anyone can help you here.
Probably it would be a good idea co contact the Facebook support.