Why to keep parameters in float32, why not in (b)float16?

AlbertZeyer · May 15, 2023, 8:00am

I wonder if I should keep my model parameters in float16 or bfloat16?

This is probably an orthogonal aspect to automatic mixed precision / autocast, or maybe mixed precision does not make sense anymore then?

But leaving that aside, why would you not do this? Is there any downside? Wouldn’t you safe even more memory?

In all the documents on mixed precision training, I never really see this part addressed.

I also found this somewhat inconclusive post: Performance (Training Speed) of Autocast Bfloat16

AlbertZeyer · May 15, 2023, 8:37am

Ah, I was just checking the original paper introducing automatic mixed precision training, and it explains it (Sec 3.1):

In mixed precision training, weights, activations and gradients are stored as FP16. In order to match the accuracy of the FP32 networks, an FP32 master copy of weights is maintained and updated with the weight gradient during the optimizer step. In each iteration an FP16 copy of the master weights is used in the forward and backward pass. …

While the need for FP32 master weights is not universal, there are two possible reasons why a number of networks require it. One explanation is that updates (weight gradients multiplied by the learning rate) become too small to be represented in FP16 - any value whose magnitude is smaller than 2^(−24) becomes zero in FP16. …

Another explanation is that the ratio of the weight value to the weight update is very large. In this case, even though the weight update is representable in FP16, it could still become zero when addition operation right-shifts it to align the binary point with the weight. …

J_Johnson · May 15, 2023, 8:47am

This thread explains the difference:

bfloat16 is more ML friendly. It gives a much wider range of values with just less precision in each stepwise difference.

Take any large model and print out your parameters:

for param in model.parameters():
    print(param)

If you’re in float32 and the model has in excess of 1 billion parameters, you will likely see many of the values are very small.

However, the smallest normal positive value float16 can have is 6.10 × 10−5. While bfloat16 can go down to 10-38. Hence float16 may require additional scaling.

Additionally, due to the nature of bfloat16 capping precision, it may act as an additional regularization to prevent some overfitting. Thus making models in bfloat16 better able to generalize.

AlbertZeyer · May 15, 2023, 10:43am

Sorry, I realize my question could be misinterpret.

I did not ask on float16 vs bfloat16.

I asked about float32 vs (b)float16. In AMP training, you still have parameters in float32, and then they are automatically casted. My question is, why is this casting needed? Why not directly store the params in (b)float16?

ptrblck · May 15, 2023, 2:18pm

Your posted reference explains it already and master parameter/gradient copies would be needed. This mechanism was used in the legacy apex.amp as its O2 style mixed-precision training.