I am trying to understand the EMA feature in this training script from pytorch/vision repository. Specifically, this part
if args.model_ema: # Decay adjustment that aims to keep the decay independent from other hyper-parameters originally proposed at: # https://github.com/facebookresearch/pycls/blob/f8cd9627/pycls/core/net.py#L123 # # total_ema_updates = (Dataset_size / n_GPUs) * epochs / (batch_size_per_gpu * EMA_steps) # We consider constant = Dataset_size for a given dataset/setup and ommit it. Thus: # adjust = 1 / total_ema_updates ~= n_GPUs * batch_size_per_gpu * EMA_steps / epochs adjust = args.world_size * args.batch_size * args.model_ema_steps / args.epochs alpha = 1.0 - args.model_ema_decay alpha = min(1.0, alpha * adjust) model_ema = utils.ExponentialMovingAverage(model_without_ddp, device=device, decay=1.0 - alpha)
In this code, the dataset size is ommited because it is constant. However, I can’t find the reason how can we omit it even if it is constant. This adjustment may improve the result on imagenet because the dataset size is very big. But for small dataset, like 10 thousand images, and small epoch, like 100, I think it would be harmful because the ema will not average all the models.
So I want to ask is there any other reason for why we omit the dataset size? Or is it just for convenient?