Scaler.step(optimizer) in FP16 or FP32?

caesar025 · August 2, 2023, 3:59pm

When using the recipe for training with AMP and GradScaler, i.e.:

                scaler.scale(loss).backward()
                scaler.step(optimizer)              
                scaler.update()

Is the optimizer step performed in full or half precision? And could this lead to issues in regard to very, very small learning rates?

ptrblck · August 2, 2023, 4:26pm

The model parameters are kept in float32 and not transformed to the lower precision dtype. The optimizer thus applies the update on the float32 parameters.