Why bf16 do not need loss scaling?

I read in this post that when using fp16 mixed precision, we need loss-scaling to Preserve Small Gradient Magnitudes.

However, bf16 has less fraction bits than fp16, so I think using bf16 will not be able to preserve small gradient values. So it seems that loss scaling is also needed in fp16.

Can you help me figure out this?

When a computer represents a number, there are certain parts of that memory allocated for exponents and certain parts allocated for precision. The exponents give a wider range of numbers(i.e. -10^32 to 10^32) while precision bits give more incremental precision in steps between two numbers(i.e. 1.1, 1.2, 1.3, … vs. 1.01, 1.02, 1.03, …).

bfloat16 allocates more memory for exponents and less for precision.

Thanks, but I still do not understand why bf16 do not need the loss scaling for better precision. since in fp16, we need loss scaling to avoid small gradient values becoming zero. And in fp16, typically we need to amplify the loss to preserve small gradient values.

bfloat16 can go all the way down to ~10e-38 whereas float16s smallest value is ~6e-8. Does it make sense why that might be beneficial when many model parameters can often be below that threshold?

1 Like