Adam+Half Precision = NaNs?

EDIT: see below, it looks like eps was the culprit after all, no need for this solution.

Alright, got this working by just hanging onto fp32 copies of the parameters and keeping all of the Adam values in fp32 as well, as shown in This Gist. I suspect you could get the desired stability and do this even more efficiently by just keeping the Adam values in fp32 (I think there’s a divide-by-0 happening somewhere) but this gives the desired memory reduction without any loss of speed over fp32.

1 Like