Assertion `input_val >= zero && input_val <= one` failed

Junhao_Yan · December 31, 2020, 12:45am

Hi, all
Recently, I changed the cpu and motherboard of my PC. But when I tried to run the training code, I encountered this problem. I haven’t changed any codes on my scripts. So I’m wondering whether it’s caused by the CPU(Ryzen 7 3800xt)
The “Traceback (most recent call last):” showed below changes everytime, sometime it happens on Conv, and sometime Batchnorm. But the assertion error happens everytime.
I have tried to fix it by myself for several days. But I cannot find much information about this error. Thus, if anyone can help, I would really appreciate it.

/opt/conda/conda-bld/pytorch_1607370193460/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [0,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1607370193460/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [1,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1607370193460/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [2,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1607370193460/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [3,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1607370193460/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [4,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1607370193460/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [5,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1607370193460/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [6,0,0] Assertion input_val >= zero && input_val <= one failed.
/opt/conda/conda-bld/pytorch_1607370193460/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [7,0,0] Assertion input_val >= zero && input_val <= one failed.
Traceback (most recent call last):
File “/home/jhyan/Scripts/PYTHON_PROJECT/FPN/train.py”, line 52, in
main()
File “/home/jhyan/Scripts/PYTHON_PROJECT/FPN/train.py”, line 36, in main
operate.train()
File “/home/jhyan/Scripts/PYTHON_PROJECT/FPN/operate.py”, line 90, in train
self.opt_encoder.zero_grad()
File “/home/jhyan/anaconda3/envs/domian-ada/lib/python3.6/site-packages/torch/optim/optimizer.py”, line 192, in zero_grad
p.grad.zero_()

ptrblck · December 31, 2020, 4:38am

Based on the stack trace, this check is failing in binary_cross_entropy_out_cuda and points towards an invalid target. Could you check the min. and max. values of your current target tensor and make sure its values are in the range [0, 1]?

Junhao_Yan · December 31, 2020, 6:53am

Hi, ptrblck
I use BCE for optimizing my GAN, and I create the target by myself with the following code so I pretty sure it’s within the range[0, 1]

                # for real samples
                if idx % 3 == 0:
                    label = torch.full([batch_size, ], 0.0, device=self.device)
                else:
                    label = torch.full([batch_size, ], 1.0, device=self.device)
                # for fake samples
                if idx % 3 == 0:
                    label.fill_(1.0)
                else:
                    label.fill_(0.0)

However, when I was debugging, I found the network outputted [[[[nan…, …]]]] when the assertion occured. It’s caused by the one of the conv layers, I put a screen shot and code below:

As you can see “out” is the output from that conv layer, and the input of that layer is “x”, which is the image and I normalize it to range [-1, 1].
the code of the conv layer is:

        self.in_convs = nn.Sequential(nn.Conv2dh3, 32, kernel_size=3, stride=2, padding=1, bias=False),
                                      nn.BatchNorm2d(32),
                                      nn.ReLU(inplace=True),
                                      nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1, bias=False),
                                      nn.BatchNorm2d(64),
                                      nn.ReLU(inplace=True))

I don’t know what is going on as I think my code is right. And before I change my cpu and motherboard. everything worked fine. Thank you so much

ptrblck · December 31, 2020, 6:59am

Your are right. The check is for the input, not the target, so I was blind.
The NaNs would also explain the failure of the check, so you would need to debug, where these NaNs are coming from.
Anomaly detection is useful in case the NaNs were created in the backward pass and might be helpful to isolate the issue. Also, check the inputs to the model for invalid values via torch.isfinite just as a sanity check.
If this doesn’t help, check the loss for NaNs and all parameters of the model to further narrow down which operation creates these values.

Junhao_Yan · December 31, 2020, 10:04pm

Thanks for your help

Junhao_Yan · January 2, 2021, 12:14am

Thanks for your advices. I have detected the bug with the Anomaly detection and torch.isfinite
It’s caused by the cv2.resize(). I use the bilinear interpolation mode to resize the input target. The range of the input to the cv2.resize() is normal, but after the resize operation, the range will beome very very large, which is over the range of float32. Thus, the torch.from_numpy() will return nan or inf.
I replace the bilinear with other mode or use the other resize mthod(transform.resize from skimage). Then everything will run smooth. However, thank you very much

ptrblck · January 2, 2021, 2:31am

That’s a bit strange. Do you know why OpenCV’s resize operation is “overshooting” some values? I would assume that the bilinear interpolation creates values “between” already existing ones.

Junhao_Yan · January 2, 2021, 10:44am

I don’t know that. I cannot find any references related to this issue.
I call isinfinte(np.max(depth_img)) before and after cv2.resize(), and the return is the value within the range and inf or -inf when the error occurs. It’s so strange. As what I said my code ran smoothly before I change the cpu and the motherboard. So I think it’s the problem of the cpu or motherboard
the stranger thing is the error occurs again not at the first epoch but later. God dam, the bug seems to have intelligence.

ptrblck · January 3, 2021, 5:08am

This seems to be a tricky issue indeed.
Since you are seeing this bug after changing the CPU and by calling into OpenCV operations, you could check, if a BLAS library might be creating this issue.
I don’t know which operations the resize method is exactly using, but assume that numpy is used and then maybe MKL.
As a quick test you could run your script with:

MKL_DEBUG_CPU_TYPE=5

to force MKL to use AVX2 on AMD CPUs and check, if this issue is still visible.

Junhao_Yan · January 5, 2021, 9:01pm

Adding the MKL_DEBUG_CPU_TYPE=5 to my script seems to solve the issue. I need to run more epoch to validate it. LOL. thank you so much

ptrblck · January 6, 2021, 5:01am

Cool that this workaround seems to work. I’m also trying to isolate this issue a bit more, as I’ve already seen in in the past couple of weeks, so could you post the resize operation, which created the NaNs?
Were you able to run this op in a loop until these invalid values were returned using the AMD CPU?

Junhao_Yan · January 7, 2021, 4:28am

The code creates this error is

depth_im = cv2.resize(depth_img, (self.output_size, self.output_size), interpolation=cv2.INTER_LINEAR)

I will try to run this op in the for loop and see whehter the error will happen or not

Junhao_Yan · January 7, 2021, 4:52am

I put the op into the for loop, and it does have problem. I used the following code

import torch
import numpy as np
from operate import check_isfinite


def main():
    while True:
        in_img = torch.randint(0, 255, [512, 512, 3])
        output = cv2.resize(np.asarray(in_img, dtype=np.float), (256, 256), interpolation=cv2.INTER_LINEAR)
        output = torch.from_numpy(output)
        max_output = torch.max(output)
        check_isfinite("max_output", max_output)
        assert torch.isfinite(max_output)


if __name__ == "__main__":
    main()

I also found that the error occurs randomly enen though I add torch.random.manual_seed(0). It’s so strange

and the part of the output is:
params name: max_output, is_finite: True
tensor(247.5000, dtype=torch.float64)
params name: max_output, is_finite: True
tensor(245., dtype=torch.float64)
params name: max_output, is_finite: True
tensor(inf, dtype=torch.float64)
params name: max_output, is_finite: False
Traceback (most recent call last):
File “/home/jhyan/Scripts/PYTHON_PROJECT/FPN/test_opencv.py”, line 17, in
main()
File “/home/jhyan/Scripts/PYTHON_PROJECT/FPN/test_opencv.py”, line 13, in main
assert torch.isfinite(max_output)
AssertionError

ptrblck · January 7, 2021, 7:32am

Thanks for the test. Is this error raised without the MKL env var, with, or in both cases?

Junhao_Yan · January 7, 2021, 7:47am

I have to say in both cases. Last time I ran my program under MKL env, it worked. But today the error appeared again

Junhao_Yan · January 13, 2021, 10:16am

I change the float type to be np.float32(the original is np.float), it works. I also try np.float64. the error will occur.

Junhao_Yan · January 19, 2021, 4:12pm

it’s the problem of the cpu. I have already claimed an return

chaslie · April 20, 2021, 2:45pm

hi ptrblk,

I am also getting this error in the Binary cross entropy criteria. I have checked the output and targets are all between 0 & 1, which they are. I have also had a look at the and i cannot find any NaN values. Is there anything else that can cause this?

cheers,

chaslie

ptrblck · April 20, 2021, 6:48pm

You could rerun the code via CUDA_LAUNCH_BLOCKING=1 python script.pt args and make sure that the right line of code is shown, which raises the issue.
If you are still seeing that the criterion is raising it, add assert statements before calculating the loss and check for the values of the model output and the targets.

chaslie · April 21, 2021, 10:18am

Hi Ptrblck,

thanks for this. I think i may be on to a winner, something was delivering a Nan to the BCE term. Think i may have solved it, will update with the solution in a day or so (depending on if the solution works )

thanks again for your help,

chaslie