Training loss behaves strangely in mixed-precision training

I have a loss function looks like loss = loss1 + alpha * loss2 where loss1 and loss2 are two loss functions and alpha is a scalar weight. I have the following observations:

(1) If I set loss=loss1, the model trains fine

(2) If I set ‘loss = loss1 + alpha * loss2’ with alpha=0, the traning interrupts with cuda error after a couple of epoches (usually between 10 and 20). Sometimes loss goes up to a high value before the error.

I have done serveal runs to confirm the above observations. This seems strange to me, as the loss in (2) should be equivalent to the one in (1). Could mixed-precision training cause something like this? Thanks!

I don’t see how mixed-precision training is involved in your use case so could you post a minimal and executable code snippet reproducing the issue, please?

Thank you for your reply. The code is part of a complex project, and the error happens when training on a large dataset. I am not sure if the error could be reproduced without these data, even if I extract the relevant code snippet (I can confirm the data is not the source of the problem though). Briefly speaking, it is actually the code I asked here. Sorry for duplicate but I just include the code below for convenience:

with torch.cuda.amp.autocast(enabled=True):
    outputs_student = model_student(inputs, targets)
    with torch.no_grad():
        outputs_teacher = model_teacher(inputs, targets)
    loss_distillation = distill_loss(outputs_student, outputs_teacher)
    loss_student = some_loss(outputs_student)    
    loss = loss_student + weight*loss_distillation 
optimizer.zero_grad()
scaler.scale(loss).backward()
scaler.step(self.optimizer)
scaler.update()

The error message is something like

/opt/conda/conda-bld/pytorch_1639180549130/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [87,0,0], thread: [59,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1639180549130/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [87,0,0], thread: [60,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1639180549130/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [87,0,0], thread: [61,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1639180549130/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [87,0,0], thread: [62,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1639180549130/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [87,0,0], thread: [63,0,0] Assertion `input_val >= zero && input_val <= one` failed.

Thanks!

I don’t know which criterion you are using but most likely the model output and/or target tensor contains out-of-bounds values. Check the min/max values of all tensors before passing them to the loss function and make sure their ranges are valid.

Thanks! I tried to save relevant variables when the crash happened with try catch. But it seems the cuda error caused the process to terminate before all data were dumped to the disk (I got EOFError: Ran out of input when I read in the dumped pkl file). Is there anyway to avoid this?

Btw: distill_loss is a sum of three torch.nn.functional.mse_loss. loss_student is a combination of torch.nn.BCEWithLogitsLoss and a self-defined iou loss.

The assert could corrupt the CUDA context so try to print the values beforehand or add an assert statement yourself checking the values in [0, 1].