Half Precision based training adaptations

nath-11743 · August 2, 2025, 10:57am

Hello,
I Wanted to try a few training recipes in fixed precision

looking for documentation on how different modules in a model should be converted correctly
my current setup just uses model.half() or model.to(torch.float16) it hasnot thrown any blocking errors, but i would like to setup my code correctly for future use

loss scaling

current LRs for different highlevel modules are set to a known and stabilised config that works in single precision and mixed precision with float16/bfloat16
1. [edit] ofc i dont expect to keep these as final lrs in half precision, only using them as a starting point
when running half precision step 2 features outputs jump to nan from early layers
looking to use a scaler for the lossfunction at this stage but the amp.GradScaler returns

  File ".../trainers.py", line 92, in run_step
    self.scaler.unscale_(self.optimizer)
  File ".../python3.12/site-packages/torch/amp/grad_scaler.py", line 342, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
                                              ^^^^^^^^^^^^^^^^^^^^^
  File ".../python3.12/site-packages/torch/amp/grad_scaler.py", line 264, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

#scaling code
            self.scaler.scale(loss).backward()
            self.scaler.unscale_(self.optimizer)
            if self.cfg.clip_grad is not None:
                torch.nn.utils.clip_grad_norm_(
                    self.model.parameters(), self.cfg.clip_grad
                )
            self.scaler.step(self.optimizer)

            # When enable amp, optimizer.step call are skipped if the loss scaling factor is too large.
            # Fix torch warning scheduler step before optimizer step.
            scaler = self.scaler.get_scale()
            self.scaler.update()
            if scaler <= self.scaler.get_scale():
                self.scheduler.step()

What would be the correct way to implement loss scaling in such a configuration with any pytorch based modules?

ptrblck · August 2, 2025, 4:47pm

Did you check the mixed precision training examples which demonstrate how to use autocast instead of explicitly transforming the parameters to a lower dtype which can cause numerical stability issues?

nath-11743 · August 3, 2025, 12:56pm

hi @ptrblck

thanks for reverting.

ofcourse ive already checked amp based training and as mentioned in 2.1, ive already got the training code working with autocast with desired model outcomes.

I intend now to train a model in half precision only.

Im getting an error using the gradscaler and likely not doing it correctly.

for a minimal exampe im adapting the example code from Automatic Mixed Precision examples — PyTorch 2.7 documentation

scaler = GradScaler()
model.half() # model.to(torch.float16)
for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        output_logits = model(input)
        #output_logits.dtype = torch.float16
        #target is torch.int64
        loss = loss_fn(output_logits.float(), target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place

        ##error!! ValueError: Attempting to unscale FP16 gradients.
        scaler.unscale_(optimizer)

        # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

scaler.unscale_(optimizer) gives the ValueError: Attempting to unscale FP16 gradients.

In hindsight i couldve phrased my initial question as just point 2.3 rather than the full set. apologies for any confusion

Dhia-naouali · August 4, 2025, 2:54am

Hii

GradScaler is designed to work in mixed precision training,

scaler.unscale_(optimizer) expects model params and grads to be in float32 to unscale the grads after back propagating a scaled loss

you’re facing the error cuz your model weights are in float16 which isn’t supported by GradScaler

when working fully in half precision you’ll have to handle loss scale / unscale manually (and carefully) to avoid over / under flows,

(no need to unscale grads if your model is fully in half precision tho since everything is already in float16)