I’m training my model, and at the third epoch, got RuntimeError CUDA error: device-side assert triggered.
Here is my code, where occurred the error.
# epoch loop
for i, (batch_z_16, batch_z_32, batch_z_48, batch_label) in enumerate(data_loader_validation):
batch_z_16 = batch_z_16.to(device=device, dtype=torch.float) # Exception here
And I also got message in terminal (using VS Code) as below:
C:/w/1/s/tmp_conda_3.6_035809/conda/conda-bld/pytorch_1556683229598/work/aten/src/THCUNN/BCECriterion.cu:57: block: [0,0,0], thread: [7,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
C:/w/1/s/tmp_conda_3.6_035809/conda/conda-bld/pytorch_1556683229598/work/aten/src/THCUNN/BCECriterion.cu:57: block: [0,0,0], thread: [12,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
C:/w/1/s/tmp_conda_3.6_035809/conda/conda-bld/pytorch_1556683229598/work/aten/src/THCUNN/BCECriterion.cu:57: block: [0,0,0], thread: [15,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
C:/w/1/s/tmp_conda_3.6_035809/conda/conda-bld/pytorch_1556683229598/work/aten/src/THCUNN/BCECriterion.cu:57: block: [0,0,0], thread: [27,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
C:/w/1/s/tmp_conda_3.6_035809/conda/conda-bld/pytorch_1556683229598/work/aten/src/THCUNN/BCECriterion.cu:57: block: [0,0,0], thread: [30,0,0] Assertion `*input >= 0. && *input <= 1.` failed.
My question is:
-
Why it doesn’t occur at the same epoch or why it doesn’t occur at the first epoch?
(I tried some times and got that it occurs at different epoch time, even sometimes it doesn’t occur.) -
It is device-side error, so can I fix it as a user? (what is device-side?)
-
What exactly it is? problem of CUDA, PyTorch or my Python coding?
-
How can I make it never happens?