I’m currently employing the LBFGS learning algorithm without line search and noticed the initial step size may be scaled relative to the gradient:
############################################################
# compute step length
############################################################
# reset initial guess for step size
if state['n_iter'] == 1:
t = min(1., 1. / flat_grad.abs().sum()) * lr
else:
t = lr
Does anyone know the reasoning for this? Most implementations I’ve seen (that don’t use line search) employ a fixed step size. I’ve also looked in the appropriate chapters in [Jorge Nocedal, Stephen Wright: Numerical Optimization] and found no discussion of scaling step size in this way.
Thanks!