RuntimeError: after reduction step 2: cudaErrorAssert: device-side assert triggered

chaslie · October 21, 2020, 10:24am

Help!!!

I have a problem during training. I get the error message in the title when using:

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

The error is triggered by the following line going to NaN

BCE = F.binary_cross_entropy(out,real_data_I,size_average=False)

I have tried using Binary cross entropy with Logits and this gives the same error. Since the error happens at random times, I have deduced that its something to do with the input date, however i have looked at the input data its not same input data that triggers the error.

Does anyone have any ideas?

Chaslie

pchandrasekaran · October 21, 2020, 2:02pm

Start of by checking if you have any nan’s in the input. The fact that another loss function causes the error as well tells that there is likely a problem with the input.

chaslie · October 21, 2020, 2:10pm

Pchandrasekaren,

I have checked the inputs and there are no NaN’s in the input.

I have re-run the model removing the sigmoid and using binary cross entropy with logits loss function it seems to be running, at the moment.

The model consists of 2 CVAE’s, and both run and work well with the input data and with the BCE = F.binary_cross_entropy(out,real_data_I,size_average=False) + Sigmoid and the BCE_withLogits and no sigmoid…

Update, the model as crashed with **RuntimeError: after reduction step 2: cudaErrorAssert: device-side assert triggered** after 5 epochs, this is using F.binary_cross_entropy_with_logits

chaslie

chaslie · October 21, 2020, 4:31pm

I have also set batchsize to 1 and shuffle to False in dataloader, then run the model, the model failed at increments 251,286 and 145, this suggests its not a problem with the input data.

ptrblck · October 23, 2020, 12:02am

Which PyTorch version are you using?
If you are using 1.5, could you please update, as assert statements in 1.5.0 were not working properly.
Also, could you post the complete stack trace you are seeing?

chaslie · October 23, 2020, 7:58am

Hi Ptrblk,

How are you? I hope you are avoiding the worst that covid is throwing at you.

I am using 1.4.0 version.

Changing the loss function from F.binary_cross_entropy_with_logits to torch.nn.BCELoss and putting the sigmoid function at end of the network seems to have resolved the error.

I think there maybe an issue with F.binary_cross_entropy_with_logits and binary_cross_entropy?

I have decreased the learning rate as well by an order of magnitude and the model is now running.

I will post the stack trace tomorrow once the model has finished training.

chaslie