Half Precision based training adaptations

Hello,
I Wanted to try a few training recipes in fixed precision

  1. looking for documentation on how different modules in a model should be converted correctly
    my current setup just uses model.half() or model.to(torch.float16) it hasnot thrown any blocking errors, but i would like to setup my code correctly for future use
  2. loss scaling
    1. current LRs for different highlevel modules are set to a known and stabilised config that works in single precision and mixed precision with float16/bfloat16
      1. [edit] ofc i dont expect to keep these as final lrs in half precision, only using them as a starting point
    2. when running half precision step 2 features outputs jump to nan from early layers
    3. looking to use a scaler for the lossfunction at this stage but the amp.GradScaler returns
    4.   File ".../trainers.py", line 92, in run_step
          self.scaler.unscale_(self.optimizer)
        File ".../python3.12/site-packages/torch/amp/grad_scaler.py", line 342, in unscale_
          optimizer_state["found_inf_per_device"] = self._unscale_grads_(
                                                    ^^^^^^^^^^^^^^^^^^^^^
        File ".../python3.12/site-packages/torch/amp/grad_scaler.py", line 264, in _unscale_grads_
          raise ValueError("Attempting to unscale FP16 gradients.")
      ValueError: Attempting to unscale FP16 gradients.
      
      
    5. #scaling code
                  self.scaler.scale(loss).backward()
                  self.scaler.unscale_(self.optimizer)
                  if self.cfg.clip_grad is not None:
                      torch.nn.utils.clip_grad_norm_(
                          self.model.parameters(), self.cfg.clip_grad
                      )
                  self.scaler.step(self.optimizer)
      
                  # When enable amp, optimizer.step call are skipped if the loss scaling factor is too large.
                  # Fix torch warning scheduler step before optimizer step.
                  scaler = self.scaler.get_scale()
                  self.scaler.update()
                  if scaler <= self.scaler.get_scale():
                      self.scheduler.step()
      
    6. What would be the correct way to implement loss scaling in such a configuration with any pytorch based modules?

Did you check the mixed precision training examples which demonstrate how to use autocast instead of explicitly transforming the parameters to a lower dtype which can cause numerical stability issues?

hi @ptrblck

thanks for reverting.

ofcourse ive already checked amp based training and as mentioned in 2.1, ive already got the training code working with autocast with desired model outcomes.

I intend now to train a model in half precision only.

Im getting an error using the gradscaler and likely not doing it correctly.

for a minimal exampe im adapting the example code from Automatic Mixed Precision examples — PyTorch 2.7 documentation

scaler = GradScaler()
model.half() # model.to(torch.float16)
for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        output_logits = model(input)
        #output_logits.dtype = torch.float16
        #target is torch.int64
        loss = loss_fn(output_logits.float(), target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place

        ##error!! ValueError: Attempting to unscale FP16 gradients.
        scaler.unscale_(optimizer)

        # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

scaler.unscale_(optimizer) gives the ValueError: Attempting to unscale FP16 gradients.

In hindsight i couldve phrased my initial question as just point 2.3 rather than the full set. apologies for any confusion

Hii

GradScaler is designed to work in mixed precision training,

scaler.unscale_(optimizer) expects model params and grads to be in float32 to unscale the grads after back propagating a scaled loss

you’re facing the error cuz your model weights are in float16 which isn’t supported by GradScaler

when working fully in half precision you’ll have to handle loss scale / unscale manually (and carefully) to avoid over / under flows,

(no need to unscale grads if your model is fully in half precision tho since everything is already in float16)