Why don't we add AdamW to the official optimizer sets?

jinserk · May 25, 2019, 11:35am

Hi all,

I’m usually using AdamW optimizer implemented by egg-west, since it is obviously and definitely effective when I train models. So I wonder why PyTorch doesn’t include AdamW or SGDR in our official optimizer sets. Is there any specific reason that AdamW or SGDR has some unclear issues in theory or in their implementation?

Thanks,
Jinserk

Tony-Y · May 25, 2019, 2:14pm

https://www.fast.ai/2018/07/02/adam-weight-decay/

This article says “fast.ai” is only the library that implemented the fix.

Tony-Y · May 25, 2019, 2:39pm

I found a PR for AdamW.

(Edit)

I think this PR is better than the above one.

Tony-Y · May 25, 2019, 4:51pm

TORCH.OPTIM.LR_SCHEDULER.CosineAnnealingWarmRestarts is a scheduler for SGDR.

github.com

pytorch/pytorch/blob/66e6571eb871bafe8d03419c17be260a9b06c32f/torch/optim/lr_scheduler.py#L656


                    momentum = max_momentum - base_height * self.scale_fn(cycle)
                else:
                    momentum = max_momentum - base_height * self.scale_fn(self.last_epoch)
                momentums.append(momentum)
            for param_group, momentum in zip(self.optimizer.param_groups, momentums):
                param_group['momentum'] = momentum


        return lrs




class CosineAnnealingWarmRestarts(_LRScheduler):
    r"""Set the learning rate of each parameter group using a cosine annealing
    schedule, where :math:`\eta_{max}` is set to the initial lr, :math:`T_{cur}`
    is the number of epochs since the last restart and :math:`T_{i}` is the number
    of epochs between two warm restarts in SGDR:


    .. math::
        \eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 +
        \cos(\frac{T_{cur}}{T_{i}}\pi))


    When :math:`T_{cur}=T_{i}`, set :math:`\eta_t = \eta_{min}`.

jinserk · May 26, 2019, 9:01am

Thanks @Tony-Y for the informations! I didn’t know the PR exists and am surprised that it has been more than a year to verify. I knew that the Fast.ai’s solution but wanted to ask why the official PyTorch doesn’t have one since the AdamW is quite effective. I’ll follow up the PRs as well. Thank you again!