How to continue training with DDP when loss is nan causing by amp

proshm · July 8, 2024, 9:06am

I tried to clear loss tenser to continue, but it didn’t work. It reports “RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn”:

    with torch.cuda.amp.autocast():
        loss, _, _ = model(samples, mask_ratio=args.mask_ratio)

    loss_value = loss.item()
    
    if not math.isfinite(loss_value):
        print(f"Loss is {loss_value} in iter {data_iter_step}, continue training")
        print(loss.shape)
        loss = torch.zeros_like(loss)
        loss_value = 0
    
    loss /= accum_iter
    loss_scaler(loss, optimizer, parameters=model.parameters(),
                update_grad=(data_iter_step + 1) % accum_iter == 0)
    if (data_iter_step + 1) % accum_iter == 0:
        optimizer.zero_grad()

    torch.cuda.synchronize()

    metric_logger.update(loss=loss_value)

    lr = optimizer.param_groups[0]["lr"]
    metric_logger.update(lr=lr)

    loss_value_reduce = misc.all_reduce_mean(loss_value)
    if log_writer is not None and (data_iter_step + 1) % accum_iter == 0:
        """ We use epoch_1000x as the x-axis in tensorboard.
        This calibrates different curves when batch size changes.
        """
        epoch_1000x = int((data_iter_step / len(data_loader) + epoch) * 1000)
        log_writer.add_scalar('train_loss', loss_value_reduce, epoch_1000x)
        log_writer.add_scalar('lr', lr, epoch_1000x)

ptrblck · July 8, 2024, 5:30pm

You should not expect to see NaN values in the loss calculation as it could indicate your model is overflowing internally.
If you still call backward on it, the gradients will also be NaNs and the GradScaler will skip the parameter update. Since nothing in the model changed, the next forward pass could still create the NaN loss and you should thus consider debugging which operation is causing this invalid value.

Skipping iterations in DDP might not be possible without a no_sync context manager as DDP expects to allreduce the gradient buckets.

xjm · November 12, 2025, 4:02am

In a multi-GPU DDP environment, if the loss on one rank is NaN while the others are normal, could this cause the all-reduce to hang?

such as when AMP produces NaNs, DDP cannot perform the all-reduce communication, which leads to a communication timeout and eventually terminates the training, right?

Arunprakash-A · November 12, 2025, 7:32am

We compute the loss only in any one of the N available devices (often rank 0), right? What do you mean by “loss on others are normal”?

Anyways, DDP only synchronizes the gradients via all-reduce, and if the loss value becomes NaN, then the training fails.

ptrblck · November 12, 2025, 7:31pm

No, the NaN loss will cause all gradient to be NaNs on this rank as well. The gradient synchronization will then create NaN gradients on all ranks and the scaler.step(optimizer) call will skip this update. The following scaler.update() step will decrease the scaling factor.