RuntimeError: CUDA error: device-side assert triggered - 1D CVAE

chaslie · March 31, 2020, 8:15pm

Hi,

using a 1D CVAE with a BCELoss function gives me the following (epoch, total number of batches, loss):

37 6912 tensor(318.8038, device='cuda:0', grad_fn=<NegBackward>)
38 6912 tensor(348.9748, device='cuda:0', grad_fn=<NegBackward>)
Traceback (most recent call last):
    data = data.cuda()

RuntimeError: CUDA error: device-side assert triggered

Does anyone know what is causing this?

The data loader statement is:

train_dataset2 = TensorDataset(X_train_2, y_train)
train_loader2 = DataLoader(train_dataset2, batch_size=BATCHSIZE, shuffle=True)

where X_train is the input 1d Vector and y_train is a linear array going from 1 to x where x is the length of the X_train array.

Chaslie

ptrblck · March 31, 2020, 9:27pm

Could you run the code with CUDA_LAUNCH_BLOCKING=1 python script.py args and post the stack trace here?
Also, is your code working fine on the CPU?

chaslie · April 1, 2020, 9:01am

Hi ptrblck,

How are you?

I am curretly running the model with:

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

I haven’t tried to run it on the CPU. I will copy the stack trace whenn it fails at epoch 39.

this is releated to my other post about NaN when using torch.nn.BCEWithLogitsLoss at epoch 39…

cheers for now,

chaslie

chaslie · April 1, 2020, 12:57pm

hi Ptrblck,

The model has completed its run with

## debugging the remove when debugged
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

and the result is:

31 6912 tensor(298.7733, device='cuda:0', grad_fn=<NegBackward>)
32 6912 tensor(287.6319, device='cuda:0', grad_fn=<NegBackward>)
33 6912 tensor(345.7334, device='cuda:0', grad_fn=<NegBackward>)
34 6912 tensor(299.0026, device='cuda:0', grad_fn=<NegBackward>)
35 6912 tensor(336.5276, device='cuda:0', grad_fn=<NegBackward>)
36 6912 tensor(304.2394, device='cuda:0', grad_fn=<NegBackward>)
37 6912 tensor(274.8873, device='cuda:0', grad_fn=<NegBackward>)
38 6912 tensor(328.6809, device='cuda:0', grad_fn=<NegBackward>)
Traceback (most recent call last):

    logvar,mean,loss,out,logqz_x,logpz,logpx_z,z2 = loss_fn(model, data)

 XXXXXXX line 286, in loss_fn
    logpx_z=-torch.sum(BCE,[1,2],keepdim=False)

RuntimeError: CUDA error: device-side assert triggered

the function with the code of question is:

def loss_fn(model, data):
    mean, logvar = model.encode(data)
    z2=model.reparm(mean,logvar)
    out=model.decode(z2)

    criterion = torch.nn.BCELoss(size_average=False,reduce=False, reduction='sum')
    #criterion = torch.nn.BCEWithLogitsLoss(size_average=False,reduce=False, reduction='sum')
    BCE=criterion(out,data)
    logpx_z=-torch.sum(BCE,[1,2],keepdim=False)
    #logpx_z=-torch.sum(BCE,2,keepdim=True)
    logpz=log_normal_pdf(z2,torch.tensor(0.),torch.tensor(0.))
    logqz_x=log_normal_pdf(z2, mean, logvar)
    mean=logpx_z+logpz-logqz_x
    
    loss=-torch.mean(mean)
    return logvar,mean,loss,out,logqz_x,logpz,logpx_z,z2

I really have no idea why this should be failing because when i look at the output for logpz in the past there doesn’t seem to be anything strange.

The model is running on the cpu:

  warnings.warn(warning.format(ret))
1 6912 tensor(627.7763, grad_fn=<NegBackward>)

regards,

chaslie

ptrblck · April 2, 2020, 12:19am

Thanks for the debugging!

Could you post the stats of BCE, before this assertion is triggered?
Using the stats of the tensor (min/max/mean), we could try to rerun the offending line of code with random values until we run into the same error and can debug further.

chaslie · April 2, 2020, 5:40am

hi ptrblck,

how do i do that?

the model has finished running on the cpu with the following:

37 6912 tensor(284.1459, grad_fn=<NegBackward>)
38 6912 tensor(298.0830, grad_fn=<NegBackward>)
Traceback (most recent call last):

logvar,mean,loss,out,logqz_x,logpz,logpx_z,z2 = loss_fn(model, data)
BCE=criterion(out,data)
result = self.forward(*input, **kwargs)
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)

 \site-packages\torch\nn\functional.py", line 2077, in binary_cross_entropy
    input, target, weight, reduction_enum)

RuntimeError: Assertion `x >= 0. && x <= 1.' failed. input value should be between 0~1, but got -nan(ind) at C:\w\1\s\tmp_conda_3.7_100118\conda\conda-bld\pytorch_1579082551706\work\aten\src\THNN/generic/BCECriterion.c:34

chaslie

ptrblck · April 2, 2020, 5:50am

Add the print statements before the loss calculation via:

print(input.mean(), input.min(). input.max())
print(target.mean(), ...)

We could use these stats to create random tensors and try to run into this issue locally.

Also, could you check your numpy version via print(np.__version__), please?

chaslie · April 2, 2020, 6:03am

hi ptrblck,

will do but it will take a couple of hours to run…

numpy version is 1.18.1

chaslie

chaslie · April 2, 2020, 3:24pm

Hi Ptrblck,

I have made a couple of changes to the code before rerunning. I have changed the logpx_z function to:

logpx_z=-torch.sum(BCE,[0,1,2],keepdim=False)

The model has now completed epoch 40:

40 6912 tensor(10543.6299, device='cuda:0', grad_fn=<NegBackward>) tensor(0.2380, device='cuda:0') tensor(3.9410e-06, device='cuda:0') tensor(1., device='cuda:0') tensor(0.2542, device='cuda:0', grad_fn=<MeanBackward0>) tensor(0.0054, device='cuda:0', grad_fn=<MinBackward1>) tensor(1., device='cuda:0', grad_fn=<MaxBackward1>)

the Cuda Launch Blocking yields:

line 334, in <module>
loss.backward()

\torch\tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)

\torch\autograd\__init__.py", line 99, in backward
allow_unreachable=True)  # allow_unreachable flag

I also have the total output of BCE for epoch 37,38,39 & 40 dumped into a pickle file.

chaslie

ptrblck · April 2, 2020, 5:40pm

Could you upload these files?
Also, which PyTorch version are you using?

chaslie · April 2, 2020, 6:14pm

Hi Ptrblck,

I am currently using pytorch 1.4.0

import torch
print(torch.version)
1.4.0

Unfortunataly the BCE dump is over 3GB and as a result is abit unweildy…

I am re-running asking for the min max and mean from the output for each epoch up to 38 and then at each increment from epoch 38 onwards (along with output from the decoder).

I will take a screen grab for the last few iteration before fail and copy them in to the post.

do you have any thoughts on what maybe causing this?

One thing though, it seems to be generating a -ve minimum BCE value:

7 6912 tensor(13018.4346, device=‘cuda:0’, grad_fn=) tensor(0.5092, device=‘cuda:0’, grad_fn=) tensor(-0., device=‘cuda:0’, grad_fn=) tensor(1.5307, device=‘cuda:0’, grad_fn=) tensor(0.3879, device=‘cuda:0’, grad_fn=) tensor(0.1181, device=‘cuda:0’, grad_fn=) tensor(1., device=‘cuda:0’, grad_fn=)

chaslie

ptrblck · April 2, 2020, 8:50pm

Thanks for the update!
Is the -0 tensor the sigmoid output of your model?

chaslie · April 2, 2020, 9:57pm

its the BCE output, I have used the following within the class model, (which i assume is applying a sigmoid function):

    def decode(self, z2,apply_sigmoid=True):
        logits=self.Decoder(z2)
        if apply_sigmoid:
            probs = torch.sigmoid(logits)
            return probs
    
        return logits

“z2” is the input into the decoder, which returns “out” and “out” is compared against “data” in BCE function…

I have tried to utilise the BCEwithLogits as well as the above function and this also failed at epoch 38ish…

There is something strange happening in that region but i have looked at all the outputs for epoch 36 onwards and cannot see the problem, i am begining to think that the problem is more fundamental…