# Exponential Moving Average in PyTorch, for weights and gradients

Do we need to apply exponential moving average to weights during training when we use Adam (or other optimizers)?

My EMA is defined as:

``````class EMA(object):

def __init__(self, mu):
self.mu = mu

def register(self, name, val):

def __call__(self, name, x):
new_average = (1.0 - self.mu) * x + self.mu * self.shadow[name]
return new_average
``````

My code is like (during each batch):

``````        y1, y2 = answer_tpos[:, 0], answer_tpos[:, 1]
loss1 = self.loss(p1, y1, size_average=True)
loss2 = self.loss(p2, y2, size_average=True)
loss = (loss1 + loss2) / 2
loss.backward()
self.optimizer.step()

# update learning rate
if self.use_scheduler:
# during warm up stage, use exponential warm up
if self.step < self.lr_warm_up_num - 1:
self.scheduler.step()
# after warm up stage, fix scheduler
if self.step >= self.lr_warm_up_num - 1 and self.unused:
self.optimizer.param_groups[0]['initial_lr'] = self.lr
self.scheduler = optim.lr_scheduler.ExponentialLR(
self.optimizer, self.decay)
for g in self.optimizer.param_groups:
g['lr'] = self.lr
self.unused = False
# print("Learning rate: {}".format(self.scheduler.get_lr()))
print("Learning rate: {}".format(
self.optimizer.param_groups[0]['lr']))

# exponential moving avarage
if self.use_ema and self.ema is not None:
print("Apply ema")
for name, param in self.model.named_parameters():
param.data = self.ema(name, param.data)

``````

And the ema is defined (in main) as:

``````# set optimizer and scheduler
parameters = filter(lambda p: p.requires_grad, model.parameters())
base_lr = 1.0
params=parameters,
lr=base_lr,
betas=(args.beta1, args.beta2),
eps=1e-7,
weight_decay=3e-7)
cr = args.lr / math.log2(args.lr_warm_up_num)
scheduler = optim.lr_scheduler.LambdaLR(
optimizer,
lr_lambda=lambda ee: cr * math.log2(ee + 1)
if ee < args.lr_warm_up_num else args.lr)

# exponential moving average
ema = EMA(args.decay)
if args.use_ema:
for name, param in model.named_parameters():
ema.register(name, param.data)
``````

I referred to the following discussion.

My problem is:

1. whether my implementation is correct? It seems that after I applied ema to weights, my training performance is not good.
2. seems that some optimizers use exponential moving average on gradients. So if they used EMA for gradients, do we need to apply EMA for weights again after optimizer?

@smth
@apaszke