Have you played around with the learning rate of Adam?
I assume besides changing the optimizer the code stays the same, so that would be my first hyperparameter to tweak.
Well, eventually I was able to train an almost sensible neural net using Adam with 0.0001 or 0.00001 lr, I don’t remember. It was still clearly worse than SGD so I abandoned it, but I was comfortable with the fact that it’s probably possible, so maybe I don’t have any NN bugs
@Dawid_S try to change the learning rate a bit.
I also have the same problem.
In a while, I trust SGD more than adam,
but it doesn’t make sense when the others report that adam have better performance than SGD. And it turns out that we must set correct learning rate to start.
source: from my bloody experience using adam.
CMIIW. I also want to know why this bug happen all of the time
I also have the same problem. I implement a ResNet+LSTM model. It’s funny that it seems adam is more unstable, or converge really slow while SGD only takes 6 epoches to converge.
But most of the report claimed that Adam should converge more quickly.