RuntimeError CUDA error: device-side assert triggered

shirui-japina · September 18, 2019, 6:48pm

I’m training my model, and at the third epoch, got RuntimeError CUDA error: device-side assert triggered.
Here is my code, where occurred the error.

# epoch loop
    for i, (batch_z_16, batch_z_32, batch_z_48, batch_label) in enumerate(data_loader_validation):
        batch_z_16 = batch_z_16.to(device=device, dtype=torch.float) # Exception here

And I also got message in terminal (using VS Code) as below:

C:/w/1/s/tmp_conda_3.6_035809/conda/conda-bld/pytorch_1556683229598/work/aten/src/THCUNN/BCECriterion.cu:57: block: [0,0,0], thread: [7,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
C:/w/1/s/tmp_conda_3.6_035809/conda/conda-bld/pytorch_1556683229598/work/aten/src/THCUNN/BCECriterion.cu:57: block: [0,0,0], thread: [12,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
C:/w/1/s/tmp_conda_3.6_035809/conda/conda-bld/pytorch_1556683229598/work/aten/src/THCUNN/BCECriterion.cu:57: block: [0,0,0], thread: [15,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
C:/w/1/s/tmp_conda_3.6_035809/conda/conda-bld/pytorch_1556683229598/work/aten/src/THCUNN/BCECriterion.cu:57: block: [0,0,0], thread: [27,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
C:/w/1/s/tmp_conda_3.6_035809/conda/conda-bld/pytorch_1556683229598/work/aten/src/THCUNN/BCECriterion.cu:57: block: [0,0,0], thread: [30,0,0] Assertion `*input >= 0. && *input <= 1.` failed.

My question is:

Why it doesn’t occur at the same epoch or why it doesn’t occur at the first epoch?
(I tried some times and got that it occurs at different epoch time, even sometimes it doesn’t occur.)
It is device-side error, so can I fix it as a user? (what is device-side?)
What exactly it is? problem of CUDA, PyTorch or my Python coding?
How can I make it never happens?

albanD · September 18, 2019, 7:30pm

Hi,

It looks like an error because some index in you BCECriterion are wrong. This may not happen in all epochs if you drop the last partial batches for example. Or if you generate these inputs on the fly and some of them are wrong.
It’s a device side that says that what you gave as input does not verify some condition. So yes you can fix it by giving correct inputs to that function
This kind of error is raised when a cuda kernel detect a problem. CUDA is asynchronous, so unless you start your code with CUDA_LAUNCH_BLOCKING=1 environment variable, the python stack trace is wrong. Unfortunaly, because of how cuda works, we can’t make these errors much more user-friendly.
The error is in the file BCECriterion.cu so it comes from a call to BCELoss (or a criterion that uses it) You can see that *input >= 0. && *input <= 1. You might want to check that your values are in [0, 1] as expected from the doc.