When a computer represents a number, there are certain parts of that memory allocated for exponents and certain parts allocated for precision. The exponents give a wider range of numbers(i.e. -10^32 to 10^32) while precision bits give more incremental precision in steps between two numbers(i.e. 1.1, 1.2, 1.3, … vs. 1.01, 1.02, 1.03, …).
bfloat16 allocates more memory for exponents and less for precision.
Thanks, but I still do not understand why bf16 do not need the loss scaling for better precision. since in fp16, we need loss scaling to avoid small gradient values becoming zero. And in fp16, typically we need to amplify the loss to preserve small gradient values.