@alexis-jacq Thank you!
I updated your code like below. I added shadow copy mechanism to save last value as average data.
Does it make sense in PyTorch manner?
class EMA(nn.Module):
def __init__(self, mu):
super(EMA, self).__init__()
self.mu = mu
self.shadow = {}
def register(self, name, val):
self.shadow[name] = val.clone()
def forward(self, name, x):
assert name in self.shadow
new_average = self.mu * x + (1.0 - self.mu) * self.shadow[name]
self.shadow[name] = new_average.clone()
return new_average
ema = EMA(0.999)
for name, param in model.named_parameters():
if param.requires_grad:
ema.register(name, param.data)
# in batch training loop
# for batch in batches:
optimizer.step()
for name, param in model.named_parameters():
if param.requires_grad:
param.data = ema(name, param.data)
Ok I see. In that case you donât need a module inheritage, since you donât want to take into account the EMA correction of your parameters while computing the next gradient. So you could simply have something like:
class EMA():
def __init__(self, mu):
self.mu = mu
self.shadow = {}
def register(self, name, val):
self.shadow[name] = val.clone()
def __call__(self, name, x):
assert name in self.shadow
new_average = self.mu * x + (1.0 - self.mu) * self.shadow[name]
self.shadow[name] = new_average.clone()
return new_average
ema = EMA(0.999)
for name, param in model.named_parameters():
if param.requires_grad:
ema.register(name, param.data)
# in batch training loop
# for batch in batches:
optimizer.step()
for name, param in model.named_parameters():
if param.requires_grad:
param.data = ema(name, param.data)
I am confused as to how to apply EMA. Is the shadow model or the actual model used to process training items and compute gradients? If its the shadow model, then I donât see how itâs different from using a smaller learning rate with the optimizer. Is it the case that the shadow model is only used at test time?
in the implementation, the moving averaged results will be used for the next iterations (last sentence). Another potential solution is only to track the moving average, but the parameters in the network is still the results from optimizer. That is, only run ema(name, param.data), but does not assign it back to param.data.
Just wondering which one is better, or which one is the commonly used one?