Hey, I’ve been working on a small side project and wanted to share it and get some thoughts from people who know this space better than I do 
GYRO (Geometric Yield Rotation Optimizer) is a PyTorch optimizer that wraps Adam with a single extra step: before updating the momentum buffers, it checks whether the current gradient and the accumulated momentum are pointing in opposing directions. If they are, it removes the oscillating component and rescales to preserve the gradient norm.
The motivation is the narrow ravine problem — when gradients oscillate between steep walls while making slow progress along the valley axis. The fix is simple: detect the oscillation via cosine similarity, project it out, move on.
It adds no extra optimizer state beyond what Adam already stores, so memory overhead is zero. Time overhead is one dot product and two norms per parameter tensor per step.
Results are modest and I want to be upfront about that. On short runs GYRO is within noise of Adam and AdamW. On 15-epoch CIFAR-10 it shows a consistent ~1% edge in best accuracy and lower training loss, which I think is real but not dramatic. On a small transformer benchmark AdamW has a slight edge. The synthetic ravine benchmark (f(x) = 100x₀² + x₁²) shows SGD failing to converge while GYRO reaches the minimum cleanly, which at least confirms the geometry is working as intended.
It has two tunable parameters beyond standard Adam: theta_base (how strong an oscillation needs to be before correction triggers) and proj_factor (how much of the oscillating component to remove — 1.0 fully removes it, 0.5 removes half).
from gyro import GYROAdam
optimizer = GYROAdam(model.parameters(), lr=1e-3)
Repo: https://github.com/sunderflowres-stack/gyro_optimizer — Apache 2.0, pip installable.
Curious whether the momentum-buffer comparison approach makes sense to people, and whether there are obvious failure modes I haven’t tested yet. Happy to be told this is equivalent to something that already exists (also awaiting for critique)
I’m working on gradient projections and multiobjective optimization, and had a similar idea in this paper (https://arxiv.org/pdf/2406.16232, Section 3, Momentum-based optimization.). But I never experimented enough with it and dropped the idea due to lack of time. Early experiments were positive though, so I genuinely think that this idea has potential, especially if you pushed it a lot further!
It’s really cool to see that you actually implemented and evaluated that. And the repo looks clean.
The synthetic ravine benchmark (f(x) = 100x₀² + x₁²) shows SGD failing to converge
Isn’t SGD (or even GD) theoretically guaranteed to converge in this setting, given an appropriate learning rate? It is smooth and (strongly) convex.
I also have a suggestion: it would be very interesting to plot the evolution over the training steps of the cosine similarity between g_t and m_t, and also the evolution of the norm of g_t and the norm of m_t over the training steps. Or at least something that gives some information about how often your projection actually happens in practice.
Thanks for sharing!
Hi Valerian, thank you for the kind words and for linking your paper! It’s amazing to see Section 3 it validates that this geometric intuition is something worth exploring. I highly appreciate your feedback.
Regarding SGD on the synthetic ravine: mathematically correct. It is strongly convex, so SGD is guaranteed to converge given a properly tuned learning rate. In my benchmark, I used a fixed LR. Because the condition number is so large (100:1), a safe LR for x_0 makes progress along x_1 painfully slow, meaning it fails to reach the minimum within the allotted 3000 steps, whereas adaptive methods handle the scaling automatically. I will clarify this in the README so I don’t make an inaccurate theoretical claim
suggestion about logging the cosine similarity, g_t norm, and m_t norm over time is spot on. Several people have asked for exactly this to see how often the projection triggers. I am working on adding telemetry to the optimizer right now to plot these exact metrics.
Thanks again for the encouragement. If you ever find the time to revisit your experiments, I’d love to hear your thoughts on how GYRO behaves in your multiobjective setups.