However, bf16 has less fraction bits than fp16, so I think using bf16 will not be able to preserve small gradient values. So it seems that loss scaling is also needed in fp16.
When a computer represents a number, there are certain parts of that memory allocated for exponents and certain parts allocated for precision. The exponents give a wider range of numbers(i.e. -10^32 to 10^32) while precision bits give more incremental precision in steps between two numbers(i.e. 1.1, 1.2, 1.3, … vs. 1.01, 1.02, 1.03, …).
bfloat16 allocates more memory for exponents and less for precision.
Thanks, but I still do not understand why bf16 do not need the loss scaling for better precision. since in fp16, we need loss scaling to avoid small gradient values becoming zero. And in fp16, typically we need to amplify the loss to preserve small gradient values.
bfloat16 can go all the way down to ~10e-38 whereas float16s smallest value is ~6e-8. Does it make sense why that might be beneficial when many model parameters can often be below that threshold?