Adam optimizer doesn't converge while SGD works fine

I am training a seq2seq model using SGD and I get decent results. My batch size is 2, and I don’t average the loss over the number of steps.

I am using PyTorch this way:

optimizer = torch.optim.SGD(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)

SGD works fine, I can observe losses decreasing slowly, and the final accuracy is pretty good.

Obviously, I wanted to try Adam optimizer to check if I can get the results any faster.

optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)

But Adam doesn’t work at all. None of the training/dev/test losses decreases (only noise) and they have far too big values.

Do you have any idea what may cause such behaviour?

Have you played around with the learning rate of Adam?
I assume besides changing the optimizer the code stays the same, so that would be my first hyperparameter to tweak.

I think 0.001 is pretty standard value, so it’s weird for me that it doesn’t work at all. Yes, the same with 0.0001.

Also I think the network is very sensitive to the SGD learning rate. It barely works with any changes

Hello, I have the same probleme… Did you find an explanation ?



Well, eventually I was able to train an almost sensible neural net using Adam with 0.0001 or 0.00001 lr, I don’t remember. It was still clearly worse than SGD so I abandoned it, but I was comfortable with the fact that it’s probably possible, so maybe I don’t have any NN bugs

1 Like

But I encourage you to try your own ideas and share insights! I think seq2seq networks have a lot of specific issues I couldn’t find on the internet.

@Dawid_S try to change the learning rate a bit.
I also have the same problem.
In a while, I trust SGD more than adam,
but it doesn’t make sense when the others report that adam have better performance than SGD. And it turns out that we must set correct learning rate to start.
source: from my bloody experience using adam.
CMIIW. I also want to know why this bug happen all of the time

best regards,
Albert Christianto

I also have the same problem. I implement a ResNet+LSTM model. It’s funny that it seems adam is more unstable, or converge really slow while SGD only takes 6 epoches to converge.
But most of the report claimed that Adam should converge more quickly.