In binary_cross_entropy, RuntimeError: CUDA error: device-side assert triggered

chaslie · June 20, 2021, 3:51pm

hi,

Hoping someone can help, In a GAN, I get the error:

C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [3,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [4,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [5,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [6,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [7,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [8,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [9,0,0] Assertion `input_val >= zero && input_val <= one` failed.
Traceback (most recent call last):
line 2762, in binary_cross_entropy
    return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered

I am using nn.Sigmoid() within the discriminator to make sure that the output of the loss is between 0 & 1 and torch.nn.BCELoss() as the loss function. Can anyone help solve the error please?

cheers,

chaslie

ptrblck · June 21, 2021, 1:50am

In case you are not using the latest release (1.9.0), could you update PyTorch and rerun the script?
If you are still seeing the issue, could you post an executable code snippet reproducing this error as well as the output of python -m torch.utils.collect_env?

chaslie · June 21, 2021, 9:00am

thanks Ptrblck, updating pytorch seems to have solved the problem, any ideas what was causing the error and how this was solved in the latest version of pytorch?

ptrblck · June 21, 2021, 7:38pm

No, I don’t remember this exact issue in the last version, but I also don’t know which release you were using before updating.

chaslie · June 22, 2021, 10:28am

fair point, i was on 1.8.3 i think???

chaslie · June 22, 2021, 10:31am

Hi Ptrblck,

i have run the torch.utils.collect_env, i hope it makes more sense to you than me

PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.8 (64-bit runtime)
Python platform: Windows-10-10.0.19041-SP0
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] numpydoc==1.1.0
[pip3] torch==1.9.0
[pip3] torchaudio==0.9.0
[pip3] torchio==0.18.25
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.10.0
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.2.89              h74a9793_1
[conda] mkl                       2020.2                      256
[conda] mkl-service               2.3.0            py38h196d8e1_0
[conda] mkl_fft                   1.3.0            py38h46781fe_0
[conda] mkl_random                1.1.1            py38h47e9c7a_0
[conda] numpy                     1.19.2           py38hadc3359_0
[conda] numpy-base                1.19.2           py38ha3acd2a_0
[conda] numpydoc                  1.1.0              pyhd3eb1b0_1
[conda] pytorch                   1.9.0           py3.8_cuda10.2_cudnn7_0    pytorch
[conda] torchaudio                0.9.0                      py38    pytorch
[conda] torchio                   0.18.25                  pypi_0    pypi
[conda] torchsummary              1.5.1                    pypi_0    pypi
[conda] torchvision               0.10.0               py38_cu102    pytorch

after 9 epochs I get the crash with the following:

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([10, 700, 4, 4], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(700, 1024, kernel_size=[4, 4], padding=[1, 1], stride=[2, 2], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams 
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [2, 2, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 00000150CFDF7190
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 10, 700, 4, 4, 
    strideA = 11200, 16, 4, 1, 
output: TensorDescriptor 00000150CFDF6400
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 10, 1024, 2, 2, 
    strideA = 4096, 4, 2, 1, 
weight: FilterDescriptor 00000150CF7DAB60
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 1024, 700, 4, 4, 
Pointer addresses: 
    input: 0000001C68600000
    output: 0000001C38DB0000
    weight: 0000001CA2000000
Additional pointer addresses: 
    grad_output: 0000001C38DB0000
    grad_weight: 0000001CA2000000
Backward filter algorithm: 1

ptrblck · June 22, 2021, 5:11pm

I assume the created code snippet in the error message runs fine or does it also crash?
Assuming the former, could you post an executable code snippet to reproduce this issue as well as which GPU you are using?

chaslie · July 3, 2021, 7:52am

hi PtrBlck,

I am using a Titan RTX GPU.

It seems changing the learning rate only delays the onset.
the executable code for a vae-GAN is:

            b_size = real_cpu.size(0)
            label_r = torch.full((b_size,), real_label, dtype=torch.float, device=device)
            label_f = torch.full((b_size,), fake_label, dtype=torch.float, device=device)
            # Forward pass real batch through D
            output = netD(real_cpu).view(-1)
            # Calculate loss on all-real batch
            errD_real = criterion(output, label_r)
            # D_x = output.mean().item()


            # label.fill_(fake_label)
            loss_G1, out_G1 = loss_fn_G_I(netG, real_cpu, device)
            output1 = netD(out_G1.detach()).view(-1)
            errD_G_real = criterion(output1, label_f)

            ## Train with all fake based on noise
            noise = torch.randn(b_size, nz, device=device)
            fake = netG.D2_Decoder(noise)
            # label.fill_(fake_label)
            output2 = netD(fake.detach()).view(-1)
            errD_fake = criterion(output2, label_f)


            errD = errD_real + errD_fake + errD_G_real
            # Calculate gradients for D in backward pass
            optimizerD.zero_grad()
            errD.backward(retain_graph=True)
            optimizerD.step()

            label_r2 = torch.full((b_size,), real_label, dtype=torch.float, device=device)
            fake2 = netG.D2_Decoder(noise)

            # with torch.autograd.set_detect_anomaly(True):
            #### now to work on the generator
            #### use just the decoder of the VAE first with fake
           
            optimizerG_D.zero_grad()
            # label.fill_(real_label)
            output4 = netD(fake2).view(-1)
            errG_F = criterion(output4, label_r2)
            output5 = netD(out_G1).view(-1)
            errG_R = criterion(output5, label_r2)
            err_G=errG_F+errG_R
            err_G.backward(retain_graph=True)
            optimizerG_D.step()

            ### now we operate on the encoder part of the VAE
            optimizerG_E.zero_grad()
            loss_G2, out_G2 = loss_fn_G_I(netG, real_cpu, device)
            loss_G2.backward(retain_graph=True)
            optimizerG_E.step()

I am at my wits end as to what is causing this. The data set used is celebrity faces

ptrblck · July 4, 2021, 10:53pm

Thanks for the update! The code is unfortunately not executable so that I cannot try to reproduce it.
Could you please update it and ping me here again?

chaslie · July 5, 2021, 9:35am

hi ptrblck,

It seems that this is a learning rate issue, if i set the LR very heigh eg 1e-4 then the error occurs, however if i set the LR to 2.5e-6 then the model will run through to 60 epochs.

How is the best way of sending you the code?

chaslie

ptrblck · July 6, 2021, 8:36am

Post or edit your code here and make sure others can run it in order to reproduce it.
I.e. check that all functions are defined and in case data is used, create random tensors, if possible.

chaslie · November 1, 2021, 12:09pm

I have finally got to the bottom of this problem. If you are seeing

C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [3,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [4,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [5,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [6,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [7,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [8,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:/cb/pytorch_1000000000000/work/aten/src/ATen/native/cuda/Loss.cu:102: block: [0,0,0], thread: [9,0,0] Assertion `input_val >= zero && input_val <= one` failed.
Traceback (most recent call last):
line 2762, in binary_cross_entropy
    return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered

Then check that you haven’t got backward(retain_graph=true) active. If you have then then revise the training script to get rid of this. It seems that the gradients are stacking up and eventulay they will “blow up”.

hao_gao · July 22, 2023, 3:51pm

My error is the same as yours, I updated pytorch but didn’t fix it. I would like to ask if backward(retain_graph=true) means loss.backward?

ptrblck · July 22, 2023, 4:00pm

Updating PyTorch won’t fix a valid indexing error. In the previous post the loss function fails as a wrong target index was used, which should be fixed in your code and is unrelated to the PyTorch version.

bagus · June 25, 2024, 2:53am

I don’t have (retrain_graph=True), but I still have this error. I guess that I got this error after updating NVIDIA Cuda, but I am not sure.

ptrblck · June 25, 2024, 12:18pm

Updating CUDA won’t change your code and thus won’t raise valid assert checks.

bagus · June 25, 2024, 9:28pm

@ptrblck thanks, I also don’t think that is that case, but (as far as I remember) it happens after apt update and upgrade in my Ubuntu system. Anyway, I solved it by changing BCELoss to BCELossWithLogitsLoss as suggested here: CUDA assertion error binary_cross_entropy loss · Issue #9 · NVIDIA/pix2pixHD · GitHub.