Training loss behaves strangely in mixed-precision training

pepper8362 · October 19, 2023, 8:35am

I have a loss function looks like loss = loss1 + alpha * loss2 where loss1 and loss2 are two loss functions and alpha is a scalar weight. I have the following observations:

(1) If I set loss=loss1, the model trains fine

(2) If I set ‘loss = loss1 + alpha * loss2’ with alpha=0, the traning interrupts with cuda error after a couple of epoches (usually between 10 and 20). Sometimes loss goes up to a high value before the error.

I have done serveal runs to confirm the above observations. This seems strange to me, as the loss in (2) should be equivalent to the one in (1). Could mixed-precision training cause something like this? Thanks!

ptrblck · October 19, 2023, 3:41pm

I don’t see how mixed-precision training is involved in your use case so could you post a minimal and executable code snippet reproducing the issue, please?

pepper8362 · October 20, 2023, 2:49am

Thank you for your reply. The code is part of a complex project, and the error happens when training on a large dataset. I am not sure if the error could be reproduced without these data, even if I extract the relevant code snippet (I can confirm the data is not the source of the problem though). Briefly speaking, it is actually the code I asked here. Sorry for duplicate but I just include the code below for convenience:

with torch.cuda.amp.autocast(enabled=True):
    outputs_student = model_student(inputs, targets)
    with torch.no_grad():
        outputs_teacher = model_teacher(inputs, targets)
    loss_distillation = distill_loss(outputs_student, outputs_teacher)
    loss_student = some_loss(outputs_student)    
    loss = loss_student + weight*loss_distillation 
optimizer.zero_grad()
scaler.scale(loss).backward()
scaler.step(self.optimizer)
scaler.update()

The error message is something like

/opt/conda/conda-bld/pytorch_1639180549130/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [87,0,0], thread: [59,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1639180549130/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [87,0,0], thread: [60,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1639180549130/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [87,0,0], thread: [61,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1639180549130/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [87,0,0], thread: [62,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1639180549130/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [87,0,0], thread: [63,0,0] Assertion `input_val >= zero && input_val <= one` failed.

Thanks!

ptrblck · October 20, 2023, 2:54am

I don’t know which criterion you are using but most likely the model output and/or target tensor contains out-of-bounds values. Check the min/max values of all tensors before passing them to the loss function and make sure their ranges are valid.

pepper8362 · October 20, 2023, 3:23am

Thanks! I tried to save relevant variables when the crash happened with try catch. But it seems the cuda error caused the process to terminate before all data were dumped to the disk (I got EOFError: Ran out of input when I read in the dumped pkl file). Is there anyway to avoid this?

Btw: distill_loss is a sum of three torch.nn.functional.mse_loss. loss_student is a combination of torch.nn.BCEWithLogitsLoss and a self-defined iou loss.

ptrblck · October 20, 2023, 3:46am

The assert could corrupt the CUDA context so try to print the values beforehand or add an assert statement yourself checking the values in [0, 1].