NAdamW and Demon optimizers

Hey everyone,

While looking deeper into optimizers, I wanted to train my model using NAdamW ( a mix between NAdam and AdamW but I could not find any pytorch implementations, only a keras implementation. I’d appreciate any help to implement it.

As for DEMON (Decaying Momentum), I did find a pytorch implementation that looks promising, but I could not find any results or comparisons. Anyone tried using it before or can point me to another implementation?

See decoupled_weight_decay in NAdam — PyTorch 2.2 documentation

decoupled_weight_decay (bool, optional) – whether to use decoupled weight decay as in AdamW to obtain NAdamW (default: False)