CUDA error always same point


I’m running a script and, always, at the iteration number 30 of the second epoch, which has nothing special, I receive the following CUDA error:

File "/home/", line 86, in forward
    print("X before mask: ", x, flush=True)
  File "/home/envs/myenv/lib/python3.8/site-packages/torch/", line 203, in __repr__
    return torch._tensor_str._str(self)
  File "/home/envs/myenv/lib/python3.8/site-packages/torch/", line 406, in _str
    return _str_intern(self)
  File "/home/envs/myenv/lib/python3.8/site-packages/torch/", line 381, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/envs/myenv/lib/python3.8/site-packages/torch/", line 242, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/envs/myenv/lib/python3.8/site-packages/torch/", line 90, in __init__
    nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) &
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

It’s absolutely always at the same point of the computation. Trying to debug, I try to print the element that raises the error, but I cannot even print it. I can print the shape though. If I use the set_detect_anomaly, I also get:

/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/DistributionTemplates.h:592: operator(): block: [118,0,0], thread: [449,0,0] Assertion `0 <= p4 && p4 <= 1` failed.

Any idea? I’m absolutely lost, I don’t know what can it be. Obviously there must be some kind of error that I’m not able to find.


I would recommend to rerun your script with CUDA_LAUNCH_BLOCKING=1 as described in the error message to make sure the stacktrace points to the actual failure.
Based on the first stacktrace it seems your masked_select operation is wrong and raises the assert while the second stacktrace would point to ab invalid argument in a sampling method.


I’ve done it and now I receive an error in a line of my script but the only hint I receive is:

File "/home/", line 171, in sample_bernouilli
    sample_prob = torch.bernoulli(prob_array)
RuntimeError: CUDA error: device-side assert triggered

It is not indicated whether the array is incompatible with the function or not. I assume that must be the case, right? Something must be happening with prob_array. But why is that not reported then?

Thanks again!

The prob_array is most likely containing invalid values as seen here:

# works
prob_array = torch.tensor([0.0, 0.5, 1.0], device='cuda')
sample_prob = torch.bernoulli(prob_array)
# tensor([0., 1., 1.], device='cuda:0')

# fails
prob_array = torch.tensor([0.0, 0.5, 1.1], device='cuda')
sample_prob = torch.bernoulli(prob_array)
# ../aten/src/ATen/native/cuda/DistributionTemplates.h:598: operator(): block: [0,0,0], thread: [0,0,0] Assertion `0 <= p3 && p3 <= 1` failed.

Print this tensor and make sure you are passing valid probabilities in the range [0, 1].