In the oldest gradient descent where batch=whole fixed sample, should the loss function at all be able to oscillate over epochs? Assuming the following algorithm:
for epoch in epochs
I think it should just decrease monotonically given a fixed sample. If the sample has randomness over epochs, then it’s possible. Has adam anything to do with that, since adam isn’t the oldest gradient descent?
Generally speaking, even in a convex optimisation problem you might have the loss oscillating. For instance, this could happen if you are close to the global optima but your learning rate is such that you overshoot and end up in a higher loss region. To avoid that, you can use more complicated algorithms such as: Backtracking line search - Wikipedia or Wolfe conditions - Wikipedia that make sure that you won’t overshoot. Those would eventually give you some guarantees about the monotonic convergence.