Have you tried to play around with the hyper parameters, e.g. lowering down the learning rate etc.?
Is the loss constant, i.e. a single value or is it noisy around the initial value?
Thanks for the update!
In that case I would recommend to try to overfit a small data sample as a simple test.
If your model is not able to overfit this small data sample (e.g. 10 samples), you might have a bug in the code somewhere, which I might have missed.
Once you can overfit, I would try to scale up the experiment slowly and make sure your model is still learning.
I am facing the same issue with my code. I am trying to train āAlexnetā model provided by torch library.
But my loss is not getting decreased. It is fluctuating between 2.311 and 2.312.
I tried changing the learning rate from 0.1, 0.05, 0.01 etc, batch size, and also epochs. But nothing is working.
Can you suggest some option? Can I send you my code? @ptrblck
Did you try to overfit a small data sample as suggested in the last post?
If not, I would highly recommend it, as itās an easy and fast way to make sure the general training code and model could work.
As you recommended, Iām trying on a small data sample (two classes), but Iām getting absolutely no movement (in validation, it guesses the same class every time, resulting in an accuracy of 0.5).
Even in the tutorial I linked, at the bottom you can see they examined training models from scratch, and even THEIR accuracy stayed completely stagnant from the beginning.
What am I missing? I suspect itās something to do with my optimizer?
Edit: Now Iām sure itās to do with the optimizer; upon finding someone else trying to implement it the same way as the original Alexnet paper, they too say it ādoesnāt trainā, and when I use the optimizer they used instead (Adam), it trains okay!
Could you help a beginner like me understand why the Alexnet paperās optimizer just doesnāt train?
Itās hard to tell what might be causing the failure in training you are seeing, but I would think it depends on the hyperparameters you are using as well as the overall training routine.
I would try to check the reference implementation (if there was one), as Iām sure there will be codes which reproduce the original claims of the paper (or come close to them). Alexnet is probably one of the more important models, so it would be surprising if the paper isnāt reproducible (unless the authors have a corrected version and the original paper doesnāt mention it).
Itās hard to tell what might be causing the failure in training you are seeing, but I would think it depends on the hyperparameters you are using as well as the overall training routine.
Can you elaborate on what might be the culprit in the training routine or other hyperparameters, then, if simply changing the optimizer from torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=0.0005) to torch.optim.Adam(params=model.parameters(), lr=0.0001) seems to alleviate my problems (where there was zero improvement in accuracy, there is now much improvement!) in the small, two-class sample run?
Additionally, I found the more or less ācanonā implementation. I see they mention using SGD in the wiki
--mini 128 train using SGD with minibatch of 128 examples
but aside from that I cannot untangle where/how itās being done. I see the learning rate stuff in the code, but nothing related to SGD with āmomentum of 0.9ā as specified in the paper and mentioned in passing in the wiki.
I think I/we are missing something nuanced about the optimizer, Iām just not sure whatā¦ any help would be appreciated (for now, though, Iāll use Adam )
Iām not an expert when it comes to the nuances of different optimizers, but from the past Iāve seen that while SGD could yield a lower final loss, making it train/converge might be harder than using a more sophisticated optimizer. As a side note: when trying to debug some code, my default optimizer is always Adam, as itās usually easy to show if the code has a bug somewhere (e.g. by accidentally detaching tensors) or if the optimization itself is not working due to a bad hyperparameter set.
With that being said, you might want to adapt the learning rate in SGD and see, if this could help.
Interesting about Adam being the go-to, I will remember that. No problem and thanks for the help, last thing, would you be able to direct me to anywhere that can educate me on the ānuanceā of these optimizers (beyond the basics; Iāve taken a look at the docs and the Adam paper but didnāt learn much practical knowledge to tackle whatever issue Iām seeing)?
With that being said, you might want to adapt the learning rate in SGD and see, if this could help.
Also, I tried this, as well as other tweaks to the SGD optimizer, to no avail Iām currently baffled
Yeah, but also be careful about it and donāt limit yourself, as Iām usually not working on training an entire model end2end, but just debug some convergence issues. Adam might not be the go-to anymore.
You should check the current literature on optimizers in ML and also refer to knowingly working repositories to see, what they are using.
@rwightman would know a lot more about successfully training state of the art CV models.
The specific optimizer shouldnāt impact the ability to converge vs not in most cases. It should be possible to use any of the common optimizers for this task, itās usually a matter of getting all the details right and searching over your hparams (if you arenāt starting from known good defaults).
That said, AlexNet is difficult to train and most adaptive optimizers tend to be more forgiving in challenging situations or non-optimal hparams. So, first q would be why Alexnet? Using a net that has normalization layers (ie BatchNorm) and residual connections will make it significantly easier to train. And if it must be Alexnet, you probably need to dig through the original impl and make sure your weight init is close to the original (which is likely not the case anymore with the pytorch default impl), you also need to use the correct batch size vs LR. I think the original was 128 batch size for .01, but not clear if that 128 factored in the 2x GPU so it miight be 256 and .01 equivalent. Being off with SGD even a small amount can mean instability with a net like this.
Thanks for all the tips about BatchNorm, weight inits, and the thought about the 2x GPU. I didnāt realize SGD was so sensitive without these considerations.
Last question: is there a succinct explanation for why Adam optimizes Alexnet handily given many different hyper-params (which Iāve tried) while SGD is stubborn? Iāve looked over the Adam paper and understand it estimates āexponential moving averages of the gradient and the squared gradientā, but I donāt understand how this takes care of an issue which, like you say, BatchNorm and other nuances about the batchsize could solve.