Dear all,

I tried to implement the " Optimistic mirror descent in saddle-point problems: Going the extra(-gradient) mile" algorithm, from here, built on top of the Adam optimizer.

This is the algorithmic structure, and I am using common learning rate for both steps:

Could it please be possible for experts with pytorch to give me some feedback for my code (am newbie on pytorch)? This is the code:

``````
import math
import torch
from torch.optim import Optimizer

"""Implements Optimistic Adam algorithm. Built on official implementation of Adam by pytorch.
See "Optimistic Mirror Descent in Saddle-Point Problems: Gointh the Extra (-Gradient) Mile"
double blind review, paper: https://openreview.net/pdf?id=Bkg8jjC9KQ

It has been proposed in `Adam: A Method for Stochastic Optimization`_.

Arguments:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 1e-3)
betas (Tuple[float, float], optional): coefficients used for computing
running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional): term added to the denominator to improve
numerical stability (default: 1e-8)
weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
algorithm from the paper `On the Convergence of Adam and Beyond`_

.. _Adam\: A Method for Stochastic Optimization:
https://arxiv.org/abs/1412.6980
.. _On the Convergence of Adam and Beyond:
https://openreview.net/forum?id=ryQu7f-RZ
"""

def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
if not 0.0 <= lr:
raise ValueError("Invalid learning rate: {}".format(lr))
if not 0.0 <= eps:
raise ValueError("Invalid epsilon value: {}".format(eps))
if not 0.0 <= betas[0] < 1.0:
raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
if not 0.0 <= betas[1] < 1.0:
raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
defaults = dict(lr=lr, betas=betas, eps=eps,

def __setstate__(self, state):
for group in self.param_groups:

def step(self, closure=None):
"""Performs a single optimization step.

Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""

loss = None

# Do not allow training with out closure
if closure is  None:
raise ValueError("This algorithm requires a closure definition for the evaluation of the intermediate gradient")

# Create a copy of the initial parameters
param_groups_copy = self.param_groups.copy()

# ############### First update of gradients ############################################
# ######################################################################################
for group in self.param_groups:
for p in group['params']:
continue

state = self.state[p]

# @@@@@@@@@@@@@@@ State initialization @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg_1'] = torch.zeros_like(p.data)
state['exp_avg_2'] = torch.zeros_like(p.data)
# Exponential moving average of squared gradient values
state['exp_avg_sq_1'] = torch.zeros_like(p.data)
state['exp_avg_sq_2'] = torch.zeros_like(p.data)
# Maintains max of all exp. moving avg. of sq. grad. values
state['max_exp_avg_sq_1'] = torch.zeros_like(p.data)
state['max_exp_avg_sq_2'] = torch.zeros_like(p.data)
# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

exp_avg1, exp_avg_sq1 = state['exp_avg_1'], state['exp_avg_sq_1']
max_exp_avg_sq1 = state['max_exp_avg_sq_1']
beta1, beta2 = group['betas']

# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# Step will be updated once
state['step'] += 1
# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

if group['weight_decay'] != 0:

# Decay the first and second moment running average coefficient

bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']

# *****************************************************
# Additional steps, to get bias corrected running means
exp_avg1 = torch.div(exp_avg1, bias_correction1)
exp_avg_sq1 = torch.div(exp_avg_sq1, bias_correction2)
# *****************************************************

# Maintains the maximum of all 2nd moment running avg. till now
torch.max(max_exp_avg_sq1, exp_avg_sq1, out=max_exp_avg_sq1)
# Use the max. for normalizing running avg. of gradient
else:

step_size1 = group['lr'] * math.sqrt(bias_correction2) / bias_correction1

# Perform additional backward step to calculate stochastic gradient - WATING STATE
loss = closure()

#
# ############### Second evaluation of gradient step #######################################
# ######################################################################################
for (group, group_copy) in zip(self.param_groups,param_groups_copy ):
for (p, p_copy) in zip(group['params'],group_copy['params']):
continue

state = self.state[p]

exp_avg2, exp_avg_sq2 = state['exp_avg_2'], state['exp_avg_sq_2']
max_exp_avg_sq2 = state['max_exp_avg_sq_2']
beta1, beta2 = group['betas']

if group['weight_decay'] != 0:

# Decay the first and second moment running average coefficient

# *****************************************************
# Additional steps, to get bias corrected running means
exp_avg2 = torch.div(exp_avg2, bias_correction1)
exp_avg_sq2 = torch.div(exp_avg_sq2, bias_correction2)
# *****************************************************

# Maintains the maximum of all 2nd moment running avg. till now
torch.max(max_exp_avg_sq2, exp_avg_sq2, out=max_exp_avg_sq2)
# Use the max. for normalizing running avg. of gradient
else:

step_size2 = group['lr'] * math.sqrt(bias_correction2) / bias_correction1

p = p_copy # pass parameters to the initial weight variables.

return loss
``````

Thank you very much for your time.

Hi Foivos,
Iâ€™m by no means an expert, but since you posted this 6 over 6 months ago I was wondering whether you have used this and whether it seemed correct?

I am not an expert either. Yes, I am using it in my GAN tests. Currently, all of my tests are based on point clouds (astronomy research on galactic dynamics), and it works much better than standard WGAN-GP. With this, I can use 1-1 training schedule and get faster convergence (WGAN-GP, 1-1 ratio of G,D training, OMD-Adam, b1=0, b2=0.9).

Now, â€śworks much betterâ€ť: means I get convergence faster into something that looks OK (for point clouds). But if I want to get the real distribution I have to leave it train for a long time (12-24h). But many more tests are needed. E.g. Iâ€™ve noticed that it is very easy for GANs to overfit outliers. Also, I am not interested for the points to be spot on to the sample, but to represent the underlying distribution. I do not know to what extent the optimizer is responsible for this. I still have some way to go with this, because it is not in my standard duties at work.

E.g. â€ślooksâ€ť good:

if you focus on the middle plot, but if you focus on the left one, I see that the fake and blue points do not have the same dispersion - so some kind of overfitting. I guess â€¦

My results are not conclusive. I can only say it speeds up convergence, and currently is my standard method for GAN tests.

All the best,
Foivos

2 Likes

Looks quite promising. Iâ€™ll check it out and thanks for answering this quickly!

Hi @Foivos_Diakogiannis, quick question. In the step function you indicate that the closure parameter is optional, yet when it is None, throw an Exception. Iâ€™m not quite sure what to use for the closure function. Could you explain this a little bit?

EDIT: I also couldnâ€™t find any mention of a closure function in the original paper, but they might have called it differently.

I created a github repo for this: https://github.com/feevos/pytorch_stuff/blob/master/pytorch_stuff/Notebooks/OMDAdam_WGAN-GP_Gaussians.ipynb

the basic idea, from what I recall from the paper, is that you need to perform calculation of the gradients twice. So I tried to implement Algorithm 3, page 20 in the appendix. Where I used the function `closure` to implement the additional gradient step. So I am calculating the loss twice, sort of.

Perhaps my code has errors, please do check and let me know if you disagree/find anything. But it works well in my tests. Please, if you play with the notebook, let me know what you think of the performance etc. Any comments most appreciated.

I hope youâ€™ll find this useful,
Regards,
Foivos