I am working on a multi-classification task (using Cross entropy loss) and I am facing an issue when working with adam optimizer and mixed precision together.
After few hours of trainings the loss start to go to NaN. I saw here that some people faced the same issue and advised to increase the
eps term of Adam, such that it will not be rounded to 0 in float16, by setting it to 1e-4 when working in half precision. (Some people used a lower eps=1e-7 as there but others arguing it may not solve the problem).
So I tried this suggestion, but with
eps=1e-4 my loss hardly decrease over time while the loss was decreasing quite fast when using the default
eps=1e-8 (until it leads to NaN), the same behavior has been observed by other people, as in this post it seems.
Is it possible to avoid those NaN without increasing too much the
eps term of Adam ?