Hi I’m working on the binarized neural network proposed here binarynet using Pytorch. I saw there’s already a version available on Github BinarynetPytorch, where it used Adam as the optimizer. I tried to change it to SGD optimizer, however, the network then was not training at all.
I changed the initial learning rate for Adam (i.e., 5e-3) to 5e-1 for SGD. I also played with momentum various momentum and batch size. Unfortunately none of them worked. The loss value just fluctuated around one value, but never got dropped.
Are there anything I should do or check?
I changed the initial learning rate for Adam (i.e., 5e-3) to 5e-1 for SGD
This always depends on the experiment, but 5e-1 seems really high for a learning rate. Have you tried with smaller values?
Yes, I’ve been playing with various ranges of learning rate. However, none of them worked for SGD. I wonder if the way binarized neural net was implemented does not support SGD?
Did you ever find the solution to the problem of SGD not working when Adam does, even after experimenting with the variables?
One aspect of Adam vs SGD is thatthe former “normalizes” the pointwise sizeof gradients, i.e. if you scale gradients by a fixed number, it will (up to those epsilon regularizations) not change Adam’s behaviour, while the steps will be scaled if you use SGD with the same learning rate as before.
Now if you have different scales of gradients (happens easily, e.g. for weights vs biases, we have an example in ch 5 of our book even for linear regression), this can impede convergence for SGD while Adam does fine. Some models (eg StyleGAN, to randomly pick a model I reimplemented and remember this) explicitly address this to work well with SGD.