I am working on a multi-classification task (using Cross entropy loss) and I am facing an issue when working with adam optimizer and mixed precision together.
After few hours of trainings the loss start to go to NaN. I saw here that some people faced the same issue and advised to increase the
eps term of Adam, such that it will not be rounded to 0 in float16, by setting it to 1e-4 when working in half precision. (Some people used a lower eps=1e-7 as there but others arguing it may not solve the problem).
So I tried this suggestion, but with
eps=1e-4 my loss hardly decrease over time while the loss was decreasing quite fast when using the default
eps=1e-8 (until it leads to NaN), the same behavior has been observed by other people, as in this post it seems.
Is it possible to avoid those NaN without increasing too much the
eps term of Adam ?
so you are using AMP with gradient scaling?
Gradient hooks will slow things down a bit, but maybe not terribly.
One quick thing you could try is whether the problem happens with a particular (group of) parameter(s) and use a larger eps only for them (by using parameter groups).
Thanks for your answer. Yes I do use AMP with gradient scaling following this recipe.
By the way, as the optimizer update is outside of the amp autocast context, is the small epsilon really problematic with respect to the precision?
Thanks for the suggestion to use parameter groups, it may be a good idea I will probably investigate this solution.
For now, I am doing few experiments with amp disabled and a batch size halved, but I will take you inform If I use back AMP.
Let’s say I have limited insight into what exactly is happening. (For this it would be interesting to see 1) the value of the param & optimizer state just before it turns NaN 2) the value of the gradient that causes the state to turn NaN.
So the epsilon helps when v_hat is 0, i.e. when v is 0.
This seems to tell me that the loss gradient is terribly small all along.
One other thing you might try is to switch Adam, which works at a coefficient level, for an optimizer that works at a tensor level like LAMB (there are more, and probably one that is closer to Adam in spirit, but this is the acronym I can remember).
Ok, LAMB and LARS optimizer seems interesting, I’m not familiar with them. I’ll definitely give them a try.
Just for information, since last message:
I tried training with Adam and eps=1e-6 + amp, the training start well, decreasing at a good rate, until it goes to NaN after some time.
I tried training with Adam and default eps=1e-8 without amp, the training is fine.
I am gonna have a look to the param values and optimizer states before the training collapse soon (I need to finish some runs before).
Just a question, why do you advice me to try LAMB optimizer? Is it because you think the trust ratio may help to upscale the loss gradient in the layers where the said gradient is very small?
Oh also, the batch size I use is relatively small as I’m limited by my gpu memory and that I work with quite heavy input data and model. This may not be ideal with LAMB or LARS, as they seems to work better on huge batch. To give a rough idea:
- With Adam optim without AMP, the max batch size I can use is only 3.
- With Adam optim with AMP, the max batch size I can use is around 5.
- With SGD or RMSPROP optim with AMP, the max batch size I can use is around 16. RMSPROP converge really slower and with significantly lower accuracy than Adam in my case. I did not tried SGD yet, but I plan to.
Still a lot of thing to try and test, I’ll give an update if I found something interesting.
First, I have to say you know a lot more about your situation also in training than me, so take all this with a grain of salt.
The key advantage of LAMB/LARS to me here is that they work at the tensor level while Adam works at the “single scalar entry” level for the adaptiveness. LARS might a more natural change from Adam than LAMB.
If you are concerned about batch size for the gradient step, you could accumulate a few batches before taking an optimizer step.
From my last messages, I mainly worked with Adam optimizer using float32 precision, also I started doing batch accumulation (giving smoother training, but also slightly better accuracy in average). I did not got the NaN problem in f32, but I observed some times a peak 100x higher than usual in my loss after hours of training making the model loosing its progress. It’s possible that those jumps in the loss I observed in f32 training share the same root cause as the NaN problem I got with mixed precision training. I think it may still coming from v_hat going close to 0 sometimes, and I realized that this kind of problem should be tackled by using the amsgrad variant of adam. As with amsgrad v_hat is taken from past maximum rather than moving average. I launched an experiment with amsgrad 48h ago, the training goes well so far. I’ll try it later with mixed precision to check if it solve the original problem, I still plan to play with LARS latter also.
To take you inform, I implemented by myself the LARS optimizer and tested it without any fine tuning yet. Using a lr of 0.01, clipping the trust ratio in [1e-4, 1], without momentum nor weight_decay, it already slightly outperform adam optimizer. I’ll continue with LARS for now, thank you for your suggestions. As I ended up using LARS I marked your last answer as the solution. I want to highlight also that using adam with amsgrad was more stable in my case than without amsgrad.