Do we need to apply exponential moving average to weights during training when we use Adam (or other optimizers)?

My EMA is defined as:

```
class EMA(object):
def __init__(self, mu):
self.mu = mu
self.shadow = {}
def register(self, name, val):
self.shadow[name] = val.clone()
def __call__(self, name, x):
assert name in self.shadow
new_average = (1.0 - self.mu) * x + self.mu * self.shadow[name]
self.shadow[name] = new_average.clone()
return new_average
```

My code is like (during each batch):

```
y1, y2 = answer_tpos[:, 0], answer_tpos[:, 1]
loss1 = self.loss(p1, y1, size_average=True)
loss2 = self.loss(p2, y2, size_average=True)
loss = (loss1 + loss2) / 2
loss.backward()
self.optimizer.step()
# update learning rate
if self.use_scheduler:
# during warm up stage, use exponential warm up
if self.step < self.lr_warm_up_num - 1:
self.scheduler.step()
# after warm up stage, fix scheduler
if self.step >= self.lr_warm_up_num - 1 and self.unused:
self.optimizer.param_groups[0]['initial_lr'] = self.lr
self.scheduler = optim.lr_scheduler.ExponentialLR(
self.optimizer, self.decay)
for g in self.optimizer.param_groups:
g['lr'] = self.lr
self.unused = False
# print("Learning rate: {}".format(self.scheduler.get_lr()))
print("Learning rate: {}".format(
self.optimizer.param_groups[0]['lr']))
# exponential moving avarage
if self.use_ema and self.ema is not None:
print("Apply ema")
for name, param in self.model.named_parameters():
if param.requires_grad:
param.data = self.ema(name, param.data)
# gradient clip
if self.use_grad_clip:
torch.nn.utils.clip_grad_norm_(
self.model.parameters(), self.grad_clip)
```

And the ema is defined (in main) as:

```
# set optimizer and scheduler
parameters = filter(lambda p: p.requires_grad, model.parameters())
base_lr = 1.0
optimizer = optim.Adam(
params=parameters,
lr=base_lr,
betas=(args.beta1, args.beta2),
eps=1e-7,
weight_decay=3e-7)
cr = args.lr / math.log2(args.lr_warm_up_num)
scheduler = optim.lr_scheduler.LambdaLR(
optimizer,
lr_lambda=lambda ee: cr * math.log2(ee + 1)
if ee < args.lr_warm_up_num else args.lr)
# exponential moving average
ema = EMA(args.decay)
if args.use_ema:
for name, param in model.named_parameters():
if param.requires_grad:
ema.register(name, param.data)
```

I referred to the following discussion.

My problem is:

- whether my implementation is correct? It seems that after I applied ema to weights, my training performance is not good.
- seems that some optimizers use exponential moving average on gradients. So if they used EMA for gradients, do we need to apply EMA for weights again after optimizer?