Why bf16 do not need loss scaling?

walle_autoscale · April 4, 2023, 1:40am

I read in this post that when using fp16 mixed precision, we need loss-scaling to Preserve Small Gradient Magnitudes.

However, bf16 has less fraction bits than fp16, so I think using bf16 will not be able to preserve small gradient values. So it seems that loss scaling is also needed in fp16.

Can you help me figure out this?

J_Johnson · April 4, 2023, 2:06am

When a computer represents a number, there are certain parts of that memory allocated for exponents and certain parts allocated for precision. The exponents give a wider range of numbers(i.e. -10^32 to 10^32) while precision bits give more incremental precision in steps between two numbers(i.e. 1.1, 1.2, 1.3, … vs. 1.01, 1.02, 1.03, …).

bfloat16 allocates more memory for exponents and less for precision.

walle_autoscale · April 4, 2023, 12:11pm

Thanks, but I still do not understand why bf16 do not need the loss scaling for better precision. since in fp16, we need loss scaling to avoid small gradient values becoming zero. And in fp16, typically we need to amplify the loss to preserve small gradient values.

J_Johnson · April 4, 2023, 1:16pm

bfloat16 can go all the way down to ~10e-38 whereas float16s smallest value is ~6e-8. Does it make sense why that might be beneficial when many model parameters can often be below that threshold?

cyk2018 · September 5, 2024, 3:24am

why we need loss scaling?
because if the loss is > 0 but < smallest data fp16 can be. the data in fp16 will be 0

Looking this picture, many data is in this interval. So if we take bf16 we can show this without loss scaling.