When using the recipe for training with AMP and GradScaler, i.e.:
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Is the optimizer step performed in full or half precision? And could this lead to issues in regard to very, very small learning rates?