UPD: oops, in this particular case it was the initialization that was applied twice; but I am quite sure that I observed behaviour like the one I describe above before!
I am working on a problem with boundary conditions. In seems like if I use SGD, method that started from the feasible point stays there, however, if I use Adam, in the first several steps it makes some random jumps that spoil conditions and then slowly starts to move towards these boundaries (from the “wrong side”). I observed similar behaviour multiple times on different problems - first few iterations of Adam seem quite large and random. I wasn’t able to find settings of learning rates that help.