Hi, I do not fully understand the problem, too. However here are some thoughts on your problem:
- Your loss decays without explicit learning rate decay. Is there a particular reason you want to get learning rate decay working?
- Adam uses adaptive learning rates intrinsically. I guess for many problems that should be good enough. You can read more on this in this discussion on Stackoverflow.
- Adam (like many other common optimization algorithms) adapts to a specific machine learning problem by computing/estimating momenta. Creating a new optimizer every epoch therefor should degrade performance due to loss of the information
- I feel like decreasing the learning rate by 75 % might be too high when using a momentum based optimizer. Would be interesting, if reducing the learning rate by something like 15–25 % gives better results.